Topic Title: Cross-Language Word Representation


Technical Area: Natural Language Processing



Globalization is one of the three long-term strategies of Alibaba in the coming five to ten years. In order to achieve this goal, it is necessary for Alibaba to build a platform that can process, analyze, and translate local languages wherever it does business. Therefore, multilingual Natural Language Processing (NLP) and Machine Translation (MT) technologies play a key role to achieve this goal. Alibaba has extensive and in-depth application scenarios and has massive multilingual data and corpora for multilingual NLP and MT technologies.


Basic NLP technologies are playing an increasingly important role both in improving translation quality and the relevance of retrieved results. Therefore, it can be critical to explore novel bilingual, or even, in many cases, multilingual, NLP technologies in the Alibaba ecology. Currently, many basic NLP tasks are completed with monolingual data sets, such as word segmentation, named entity recognition, and so on. These tasks have a good performance record on high-resource languages but perform poorly in low-resource languages due to the sparseness of training data. By incorporating unsupervised criterion and adversarial evaluation, learning word embeddings of low-resource languages can benefit from both bilingual and monolingual settings. Therefore, transferring monolingual knowledge into low-resource languages is of great value in both the research and practical purpose.



We are looking for collaboration on this topic to explore transfer learning, joint learning or novel algorithms to learn multilingual word embeddings, therefore, we could make full use of resource-rich languages to solve NLP tasks in resource-poor languages.


Related Research Topics

1. Subword-level information: Out-of-vocabulary words are common in language modeling, and meaning can be more accurate with auxiliary information, e.g. stroke and stem. Existing work mainly focuses on learning subword-level information for high-resource languages.

2. Multi-word expression: Like phrase can express more precise than single character, multi-word expressions are important to most languages. That meaning cannot be derived from standard representations like ba