Topic Title: Emotion Recognition for Natural User Interaction


Technical Area: Speech Recognition, Computer Vision, Natural Language Processing, Sensor Fusion


Background: Natural User Interface

With the improvement of the accuracy of speech recognition, natural language understanding and computer vision, users start to interact with machines through voice, gesture, etc. These interfaces are referred as natural user interfaces. In July 2017, the AI Labs of Alibaba Group released a smart speaker named “Tmall Genie X1”, and since then more the 2 million smart speakers were sold. With the popularity of the smart speakers, we believe that the natural user interfaces will be one kind of primary portals to Internet.


Based on the 7-38-55 Rule of personal communications, words affect only 7% of the communications, however tones of voice contribute 38%. By using the emotion recognition technique, natural user interfaces could be improved in two possible aspects. Frist, the emotion recognition “tells” the machine the emotional state of the user, which helps the decision making of the dialogue policy and improves the accuracy of recommendation. Second, the recognition of the user’s emotion is regarded as an important feedback of the user’s response towards to the previous interactions. This feedback information can improve the dialogue policy through reinforcement learning.


Many existing works focus on emotion recognition. However, it is still very challenging to develop robust and accurate emotion recognition with limited number of sensors. Experiments show that, even human labeler is in use, it can only achieve the correct emotion recognition on 70%-80% of the samples.



We target to develop emotion recognition algorithms through user’s voice and language for our current products like smart speakers, which should cover all six basic emotions including anger, disgust, fear, happiness, sadness and surprise.


The accuracy of total emotion recognition should be larger than 65%, and the accuracy of emotion recognition for negative emotions like anger and sadness should be larger than 80%.


With the consideration of mounting the camera onto our smart speaker, we also propose to develop joint emotion recognition algorithm by using of user’s voice, language and image sequence.


Related Research Topics

1. Policy Learning for Dialogue System

The policy determines the response of the dialogue system according to the user’s input and status. Therefore, building a learning framework is very important to improve the policy of dialogue systems.

2. Meaning Representation Language for Natural Language Understanding

The structural representation of the meaning of a language is an interface between the natural language and the Internet services. Since huge number of intents and slots are possible to be represented with a natural language, we hope to develop an effective meaning representation language to achieve this purpose.