Theme title: Adversarial Intelligence

 

Technical Area: Information Retrieval, Recommender System, Spammer Detection, Item-to-Item Similarity, GAN, Multiview Learning, Graph Mining, Geometric Deep Learning

 

Background

Deep learning (DL) and artificial intelligence (AI) are the supernovas in the recent decade and achieve superior results in many domains: natural language process, pattern recognition, computer vision, knowledge graph, e-commerce marketing, spammer detection and so forth. However, on the e-commerce platform, the big-bang of DL and AI also empower cyber- attackers/their adversarial models on attacking algorithms and poisoning training datasets and make the spammer/malicious behavior more changeable and unpredictable. All in all, it is the best of the time, and it is worst of the time. There is an urgent need to provide a systematical system with shields and spears to be against the novel problems on the e-commerce platform.

 

Shields: The Large-Scale Multiview Geometric Deep Learning on Graphs

Graph mining is the fundamental tool to deal with graph data in the multitude of different fields in and out the academic world, such as Biology, Network Science, Recommender Systems, Fraud Detections and more. Taking Taobao for example, buyer, commodities (items), and sellers are basic “node” elements to construct large-scale heterogeneous networks. By applying various graph mining models, we are able to provide solution to different business problems: recommender system can be modeled as link prediction problem, click farming and malicious behavior detection can be abstracted as node classification, and crowdturfing detection can be solved by edge classification and node clustering.

 

On the other hand, in real-world problems, multi-view data is widespread. For example, at Taobao, there are text descriptions and images on the detail page of each item, which describes the properties of the item from different aspects. Multi-view learning aims to integrate the information and knowledge from different scopes and sources, and better deal with downstream applications such as item profiling, similarity calculation and clustering. For example, automatic summary of commodities can be generated through the fusion of text descriptions and images.

 

As a more prevalent situation, a graph can be heterogeneous, which means that nodes and edges can be of different types. Also, apart from the graph structure information, nodes and links also possess their own attributes, which was omitted in previous learning algorithms. Therefore, the advantage of multi-view learning can be amplified in graph data with the powerful assistant of deep learning techniques (known as Geometric Deep Learning). By preserving nodes' information, edges' information and graph structure information from graph data through deep learning and multi-views learning, the integration is guaranteed to improve the robustness and precision of existing graph mining models, and effectively investigate graph data from multiple aspects, resulting in better performance in the scenarios where the features come from different sources: recommender system, buyer behavior prediction, spammer detection, crowdturfing identification, fraudulent complain discovery, click farming detection, document classification among many others. For example, on Taobao platform, in spammer detection, a systematical integrated multi-view geometric deep learning is eagerly expected to deeply combine multi- view learning, and deep learning within graph mining to effectively recall vast majority of crowdturfing and malicious behaviors by using existing sparse distributed existing suspicious activities with users’ profile information, commodities’ information, and their relationships (such as searching, clicking, collecting, adding to cart, trading and so).

 

Spears: Adversarial Intelligence

With the development of deep-learning-based algorithms and platforms in the past decade, a new era of artificial intelligence (AI) is dawning. It has been successfully applied in different domains, such as computer vision, natural language processing, recommender system, and so forth. However, AI also empowers attackers and their adversarial models on attacking machine learning algorithms and poisoning training datasets. On Taobao platform, we also encounter before-mentioned attacks, since information retrieval (IR) and recommender system (RS) is widely applied in e-commerce market to promote the online sale business.

 

However, the state-of-art spammer detection methods are either based on already-known patterns or trained on labeled data. The results of these methods are then filtered in the training data in the following IR or RS. These detection models may be ineffective when encountering spamming behaviors with new patterns.

 

In order to better protect IR and RS, we aim to identify new patterns by machine learning algorithms. Generative models are one possible solution to this problem: the generated samples may be new spamming pattern if it can confuse the classifier (spammer detector).

 

For example, in the training process of Generative Adversarial Network(GAN), the exploration of new patterns and the strengthening of the classifier are accomplished simultaneously. Additionally, generative model not only can improve spamming detector, but it also can boost other machine learning algorithms, for instance, by adding generated samples into training data.

 

We invite professional researchers who are keenly aware of the challenges and opportunities, and whose interested fields are related to adversarial intelligence, multi-view graph mining and geometric deep learning to construct the new systematical adversarial intelligence system with the novel large scale multi-view geometric deep learning framework that integrates deep learning and graph mining in a multi-view manner, and the adversarial intelligence techniques that boost both spammer detection and machine learning algorithms for information retrieval or recommender system.

 

Targets

The “shields” part enables the platform to improve the recall and precisely capture the dynamic pattern of spammers, click farming, fake reviewers, malicious behaviors and so on. Here, we propose a convincing framework that seamlessly integrates multi-view learning, deep learning and graph model as a whole to cover the holistic view of the large-scale real-world network problems.

 

The “spears” part provides an attack and defense techniques that automatically generate online users’ spamming behavior for spammer detection and simulate training samples for recommender system that may bring extra vitality to the recommender system. Here we propose a principled effort to investigate different angles for adversarial intelligence’s applications in either enhancing spamming detection or boosting other algorithms.

 

Related Research Topics:

Shields: The Large-Scale Multiview Geometric Deep Learning on Graphs

 

Currently, graph embedding is among the most popular algorithms in graph mining, which preserves the relationship between nodes as well as projects the nodes into hidden lower dimensional vector space. The embeddings represent the network and can be considered as features for further downstream graph mining inputs. Many recent works concentrate on linearizing the graph through random walk technique and multi-layer networks. Some emerging works already start considering graph convolution networks to propagate information among nodes of the graph. But still, there are many open questions and challenges in this area:

1. From Single-View to Multi-view. In real-world networks, multiple definitions of proximities are possible. For example, the proximity of two items can be described as the similarity of the image description of items, indicating the color/style similarity between items; or it can be extracted from the buyer-item relationship, indicating similarity of items in terms of consumer attraction. Each proximity definition reflects the source and aspect from a different view of the researched network, and multiple of these networks are referred as “multi-view” network. In spite that most existing works focus on the network with a single view, how these works can be extended to multi-view proximities remains to be explored.

 

2. Sparsity Problem. In some graph mining problem, such as the item2item similarity estimation and spammer detection problem, the graph structure is usually sparse, and there are even disconnected components or isolated nodes. Current graph embedding works can rarely address the sparsity problem.

 

3. Cold Start Problem. In recommender systems and spam-detection problem, a common problem is the cold start problem, i.e., when the platform contains unseen node or has not yet acquired enough ratings and information to generation reliable recommendations/predict spam probabilities. Most existing graph embedding works that either linearizer the graph or apply random-walk type algorithms are not able to generalize to the unseen node, i.e., the entire graph embedding needs to be retrained if an unseen node is added. This is not an optimal solution in large-scale networks.

 

4. Scalability Problem and Real-time Requirement. In e-commerce, these are two essential problems to deploy the graph model. The graph usually contains hundreds of millions or even billions of nodes and edges. Existing graph embedding works target to represent and preserve the node information and network structure and introduce too many parameters to scale to large graphs in e-commerce scenarios. Meanwhile, the real-time and streaming graph processing is also critical to e-commerce. How to efficiently and incrementally update the learned graph embeddings in real-time? How to deal with outdated and irrelevant information? How to incorporate edge timing? These questions are challenge problems and need to be properly addressed.

 

Spears: Adversarial Intelligence

Currently, obtaining a vector representation of heterogeneous nodes of such a graph is inherently difficult and poses several challenges where related research topics may arise:

1. Generative Adversarial Network

Aforementioned GAN has recently drawn attentions in computer vision, natural language processing and even graph representation learning. GAN can mimic the target distributions and generate samples which can confuse the discriminator.

 

2. Anomaly Detection

Currently, there are two kinds of anomaly detection methods, one is supervised and the other is unsupervised. Most of the real-world problems are to detect anomaly online behaviors without anomaly labels, i.e. unsupervised. However, most of the unsupervised methods, such as detecting dense block in behavior tensors, are based on known patterns and cannot avoid the curse of dimensionality. The unknown pattern hidden in hundreds of thousands of features and billions of samples remains unsolved.

 

3. Information Retrieval

Though there are some work that combines GAN in the training process of information retrieval and outperforms existing models, it is worth further exploring in the combination of these two fields, for instance, to generate proper negative training samples to solve the non-random-missing-data problem in e-commerce.

 

4. Recommender System

Adversarial models and robust recommender system was proposed in the early years of 21st century. However, early research work was mostly focused on explicit feedback data such as movie ratings. Moreover, there are few works on improving the robustness of complicated recommender system models by revising the models themselves. On Taobao platform, the recommender system is built upon online behaviors such as click or payment, which are implicit feedback and much easier to fake. Therefore, it is more important and urgent for Taobao to identify the spamming click/payment or to improve the robustness of recommender system to immune to these spamming attacks.