This article is part of the Academic Alibaba series and is taken from the paper entitled “CoLink: An Unsupervised Framework for User Identity Linkage” by Zexuan Zong, Yong Cao, Mu Guo, and Zaiqing Nie, first published in 2018 by the Association for the Advancement of Artificial Intelligence. The full paper can be read here.
Many entities have information on multiple knowledge graphs, each one giving a different snapshot of the same entity. Users that want to know these entities better can gain valuable insights from combining the information from across these various graphs.
Existing attempts to identify and link matching entities automatically have so far provided mixed results. Now, researchers from Alibaba AI labs, Microsoft and the University of Illinois have developed an approach they call CoLink, which matches entities much more accurately and comprehensively than existing systems.
For a system to identify matching entity profiles on two networks, it must compare the entity information (“attributes”) on both platforms and link profiles with matching attributes. Previous systems have performed this task using string similarity functions, which compare strings of text for similarity.
Since entity attributes are formatted differently on different platforms — “ESCAL ENG” on one platform may correspond to “Escalation Engineer” on another — the string similarity functions require an initial set of confirmed matches to train the system. “Unsupervised” approaches, where the system is instructed to collect training data automatically, usually require tailoring to the specific platforms used, and cannot be generalized to work with all platforms.
CoLink is the first general unsupervised solution. It uses a brand-new framework with two methods of linking profiles and a co-training algorithm to coordinate between them. The first method is a familiar user-attribute-based method, and the second is a relationship-based method, which identifies candidate pairs based on mutually related entities. Each method analyzes profiles and decides whether they should be linked independently an in iterative process, while the co-training algorithm uses high-quality matches from both models to retrain the models between iterations.
(The CoLink co-training algorithm)
Instead of a string similarity function, CoLink’s entity-attribute-based method uses a machine translation algorithm to match entity attributes. This approach identifies matching attributes more successfully, and can even identify “implicit connections” — attributes that match because they contain the same information, but bear little or no textual resemblance to each other.
The CoLink framework was tested against other unsupervised approaches, including SiGMa and Alias-disamb. The following table shows the results, with the overall performance calculated as the average of the accuracy and completeness of the match results.
CoLink yielded impressive results, outperforming the next-best approach by as much as 20%. These results show that CoLink offers a substantially more accurate and comprehensive method of linking entities across multiple knowledge graphs than was previously possible.
The full paper can be read here.