Topic Title: Neural Machine Translation for Low-Resource Languages


Technical Area: Natural Language Processing



Globalization is one of the three long-term strategies of Alibaba in the coming five to ten years. In order to achieve this goal, it is necessary for Alibaba to build a platform that can process, analyze, and translate local languages wherever it does business. Therefore, multilingual Natural Language Processing (NLP) and Machine Translation (MT) technologies play a key role to achieve this goal. Alibaba has extensive and in-depth application scenarios and has massive multilingual data and corpora for multilingual NLP and MT technologies.


Deep Learning has made remarkable progress and has had a profound influence on NLP and MT technologies in recent years. Since Neural Machine Translation (NMT) was proposed in 2014, it has become the mainstream in Machine Translation technology over the past three years; the translation quality of NMT is better than that of traditional Statistical Machine Translation (SMT) technology in most languages and scenarios. Although NMT technology has made significant progress, there are still some problems that need to be improved or remain unsolved, which will have a big impact on the practical application of NLP and MT technologies.


Training a practical Neural Machine Translation (NMT) system usually requires large amounts of parallel corpora, but most of the language pairs in the world do not have many or only have few bilingual parallel corpora. Therefore, in addition to collecting more parallel corpora, it is of great practical value to study how to train a practical NMT system with monolingual corpora or limited resources. Besides, if we can transfer knowledge of high-resource languages to low-resource languages, it will also help to solve the problem of inadequate corpora and improve translation quality. It is also worth noting that bilingual dictionaries have great influence on the translation quality of the two languages, especially in the translation of specific domains, such as e-commence and science domain. However, Zero-Resource languages often lack bilingual dictionaries, and many terms cannot be accurately translated.



We are looking for collaboration on this topic to:


Related Research Topics

The related research topics are zero-shot neural machine translation, transfer learning, unsupervised neural machine translation, Bilingual lexicon induction, etc. The current zero-shot neural machine translation [1] method involves a pivot language to accomplish translation between two languages without parallel data. Such as, if we have parallel data for Chinese-English and English-Spanish, but no parallel data for Chinese-Spanish, we train a single model to translate from Chinese to English and English to Spanish, and also from Chinese to Spanish via shared word embeddings and model parameters. Transfer learning [2] methods first train a model on high-resource language pair, then transfer some of the learned parameters to the low-resource pair to initialize the training, thus to improve the performance of the low-resource machine translation. Unsupervised neural machine translation [3,4] first trains word embeddings from two monolingual data then maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space and using a combination of denoising and back-translation technologies, the model effectively learns to translate without any parallel data.



[1] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean. 2017. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.

[2] Barret Zoph, Deniz Yuret, Jonathan May, Kevin Knight. 2016. Transfer Learning for Low-Resource Neural Machine Translation.

[3] Mikel Artetxe, Gorka Labaka, Eneko Agirre, Kyunghyun Cho. 2017. Unsupervised Neural Machine Translation. [4] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Marc'Aurelio Ranzato. 2017. Unsupervised Machine Translation Using Monolingual Corpora Only.