Theme Title: Efficient Large Scale Graph Embedding with Applications in Cognitive Computing
Large Scale Graphical Model
Success of the past decades, large scale graphical modeling based cognitive computing got deployed in many real applications outside of academia. IBM’s Jeopardy-winning system Watson, Apple’s Siri, Google’s Knowledge Graph and Facebook Graph Search would not have been possible without the advances that researchers have made in the past decades. At Taobao, buyers, sellers and commodities form a huge heterogeneous graph and many businesses could be modeled as graph analytic problems. For example, Guess Your Likes can be abstracted to link prediction; spammer detection can be abstracted to node classifications.
However, we should not forget that though data is the foundation underpinning information processing, the real value is in the information that we are able to surface by processing the data, in the patterns that we are able to identify and recognize, and, ultimately, in the knowledge that we extract. So, we can say that from Big Data we usually distill Small Knowledge, which has Big Value. The problem of representing knowledge in a machine processable format is a well-known field of Artificial Intelligence, which has been extensively studied since the 1970s. An approach to knowledge representation based on graph (mathematical structure represented as sets of nodes, or vertices, which may be connected by edges) has been discussed for a long time, starting from the introduction of conceptual graphs to the more recent Linked Data initiative– a method to publish data and knowledge over the Web by explicitly representing their relationships, thus enabling computers to directly access and semantically query such distributed knowledge graphs.
Graphical models are a possible approach to building large knowledge bases, structured collections of facts about the world that computer systems can use to reason, and to interact with humans more naturally. These are some of the key characteristics of cognitive computing. The impact of this new computing approach are potentially very high, and, combined with other technologies current under development, it promises to open up whole new ways for humans to use computers. To generalize, graphical modeling can enable scientists to process their big data, and, critically, to extract knowledge from the reams of data that they collect. Advances in analytics that drive markets, personalization, and manufacturing will add knowledge and reasoning to models that engineers use today. We believe that large scale graphical modeling will enable much further scientific advances in big data related areas.
Large scale graphical modeling has been playing a key role in assuring success in many of the challenges that the industry and academia face in its data- and knowledge-driven technologies in the coming decade. Extracting knowledge from data, creating new knowledge-driven applications, and generating new expressive knowledge will likely lead to advances in countless areas. Almost any domain that has any data to process into knowledge will benefit from advances in large scale graphical modeling, e.g., biomedicine, health care and life sciences, oil and gas industry and sustainable energy, engineering, earth and environmental sciences, autonomous robotics, education, digital humanities, social sciences, finance, and geosciences, among many others.
We invite researchers who are either experts or are keenly aware of the challenges and opportunities that their fields bring to graphical modeling to work on the new framework of deep graph that seamlessly integrates computing (with visualization), learning and inference, allowing us to stay focused on these real-world problems and use cases that graphical modeling framework as whole can and should help solve with global optimum settings.
However, we notice that the current development of large scale graphical modeling is relatively scattered, e.g., the development of learning methodology and inference infrastructure are not optimized simultaneously due to the different focuses of researchers from different areas. Thus the whole framework has been far from optimum settings.
We propose a principled effort to investigate a new framework of deep graph that seamlessly integrates computing (with visualization), learning and inference in a consistent and optimum set up. This proposal represents a paradigm shift of large scale graphical learning, from local optimum of one phase of the whole system to global optimum of the framework playing a central role for computing (with visualization), learning and inference as a whole.
Related Research Topics
We are interested in but not limited to the following topics:
- Attributed Network Embedding based Fraud Detection
- Large Scale Knowledge Graph based Cognitive Computing and its Inference
- Multi-level Heterogeneous Embedding Propagation
Recently, the methods based on representing networks in vector space, while preserving their properties, have become widely popular. The embeddings are inputs as features to a model and the parameters are learned based on the training data.
Obtaining a vector representation of heterogeneous nodes of such a graph is inherently difficult and poses several challenges where related research topics may arise:
- Heterogeneity: A “good” representation of nodes should preserve not only the structure of the graph but also the entity difference. Existing works mainly focused on vertex embedding of homogeneous graphs.
- Scalability: The graph formed by buyers, sellers and commodities contains billions of nodes and edges. Defining a scalable model can be challenging especially when the model is aimed to preserve global properties of the graph.
- Reliability: Observed edges or node labels can be polluted and thus are not reliable.
To solve the above three challenges, one possible way is to decompose the total loss into three parts: structure loss, attributes loss and meta path loss.
- Structure Loss: observed labels are partial and not always reliable but they describe the topological structure of the underlying graph;
- Attributes Loss: nodes are often accompanied with a rich set of attributes or features. Modeling and incorporating node attributes proximity into network structure could be potentially helpful;
- Meta Path Loss: to account for heterogeneous graph.
To minimize the total loss that consists of the three above parts and learn corresponding parameters could be very challenging in an extremely large scale graph. In practice, the computational complexity needs to be close to O(log n), where n is the size of the nodes and edges due to the extremely large size of our graph, usually hundreds of billions of nodes and billions of edges.