Speaking Up: Optimizing Large Vocabulary Speech Recognition Systems
    2018-04-20    Academic Alibaba

This article is part of the Academic Alibaba series and is taken from the paper entitled “Deep-FSMN for Large Vocabulary Continuous Speech Recognition” by Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, accepted by IEEE ICASSP 2018. The full paper can be read here.

Deep neural networks have become the dominant acoustic model used in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Neural networks include both Feed-forward Neural Networks (FNN) and Recurrent Neural Networks (RNN). Although RNNs have been shown to significantly outperform FNNs, the learning capabilities of RNNs usually rely on Back Propagation Through Time (BPTT) due to internal recurrent cycles. This significantly increases the computational complexity of learning and also may cause such problems as gradient vanishing and exploding.

Previously, the Alibaba tech team proposed a novel neural architecture non-recurrent structure, namely Feed-forward Sequential Memory Networks (FSMN), which can effectively model long-term dependency in sequential data without using any recurrent feedback. To further improve this structure, researchers from Alibaba and the University of Science and Technology of China (USTC) now propose the Deep-FSMN (DFSMN), which introduces skip connections between memory blocks in adjacent layers. These skip connections alleviate gradient vanishing by enabling information flow across layers in deep structures. 

Moreover, considering the demand of real-world applications, the research team propose to combine the DFSMN with Lower Frame Rate (LFR) technology to speed up decoding and optimize the DFSMN topology to meet the latency requirement. An outline of this combined system is illustrated below.

 

 

When the effectiveness of DFSMN was evaluated on multiple large speech recognition tasks in English and Mandarin, significant achievements were noted in two areas. Firstly, due to the small model size of the DFSMN, the 2,000 hours English Fisher (FSH) task achieved a 1.5% word error rate reduction when compared to the widely used Bi-directional Long-Short Term Memory (BLSTM). Secondly, in the 20,000 hours Mandarin task, the LFR-trained DFSMN achieved a performance improvement of over 20% compared to the Latency Controlled BLSTM (LCBLSM). Moreover, the memory blocks lookahead filter order in the DFSMN can be designed to match the latency demand of real-time applications.

Overall, it was found that the LFR-trained DFSMN with a 5-frame delay outperformed the LFR-trained LCBLSTM with a 40-frame delay. These results show that this new approach to speech recognition systems has definite real-world applications that have real-time speech recognition.

 

Read the full paper here.