Topic Title: Optimizing Distributed Execution of Deep Learning System
Large-scale Machine Learning
With the advent of huge amount of data, easy-to-get high performance computation devices (GPGPU) and progresses of modeling techniques, deep learning becomes more and more pervasive in various application scenarios. Inside Alibaba, deep learning has already been successfully deployed for e-commerce 、 machine translation 、 customer intelligence service 、 speech recognition and computer vision tasks. However, due to the model complexity and huge amount of training data, training deep learning model is usually time-consuming. For example, for a typical large-scale face recognition task with 200millions of training samples, it takes more than three weeks to converge on a single NVIDIA P100 GPU, which has 10Tflops peak computation power.
To ensure the agile iteration of deep learning model innovation, distributed training of deep learning is necessary.
The training process of deep learning model is usually executed in a mini-batch style. i.e. for each training step, a small subset of training samples (usually from 16 to 1024) is picked up, then based on the latest model status a forward execution pass is taken to calculate the model prediction result and also the difference against labeled data(usually called loss). After finishing the forward pass, based on the computed loss, another pass called backward execution is invoked to calculate the update we want to apply on the model weights for better model performance. The forward and backward pass composes a single training step or iteration. And training steps keep iterating until the model converges to a point satisfying business requirements. Distributed execution can be used for speeding up the training process. E.g. for the previously mentioned large-scale face recognition tasks, based on our carefully designed distributed execution strategy, the training process can be shortened from 3 weeks to less than 4 days.
Although there are some successful distributed strategies for speeding up deep learning training process, there are still several open challenges:
1. Due to the mini-batch nature of deep learning training process, large-scale scalability is always a challenge. Without a careful design, there will always be some computation power wasted due to waiting for the completion of communication task.
2. Due to the fast evolving pace of deep learning models and diversity of underlying computation devices, universal speed-up strategy is what we need to pay attention to.
3. Given a specific hardware platform (including both computation device and communication device), the performance upper bound is also constrained. So to further push distributed performance, joint view from both model and system perspectives must be taken to break the boundary of system performance. There is a lot of trade-off and big design space for designing such mixture optimization strategy.
Related Research Topics:
By working with the research and academia community, it is expected that we can make some breakthrough in the following areas:
1. System-level optimization, in which by designing better pipeline execution 、 distributed placement strategy (which computation part should be placed into GPU, which on CPU, FPGA, etc.) and scheduling strategy, the computation and communication can be overlapped as much as possible, thus reducing idle time of hardware resources. It is ideal if the system scalability could be extended to more than thousands of heterogeneous high-performance GPGPU devices.
2. Optimization algorithm design, in which by tailoring training algorithm to better matching distributed execution scenario (for example, large-batch oriented training and gradient compression), the scalability could be further improved. For this research track, what we want to emphasize is the generalization of the proposed optimization strategy. Since given a specific model, it is not quite difficult for tuning a training procedure with better scalability, but as AI infrastructure provider, what we mostly care about is how general the training algorithm could be. Otherwise, a lot of algorithm tuning headache will be pushed to the user side, which is unacceptable from infrastructure perspective.
3. Design models friendly for distributed execution, in which it is expected that there are some models which are better suited for distributed execution, one of the example is lightRNN proposed by Microsoft Research Asia. And we are looking forward to principal way for designing models friendly for distributed execution.
4. Auto-parallel, with which the high-level user provided single-node model description could be automatically transformed into a distributed execution plan running efficiently on the heterogeneous clusters. We wish that with auto-parallel, modeling users don’t need to care about the underlying computation and communication hardware characteristic and they just need to focus on the model descriptions and let the deep learning engine decide the detailed distributed execution plan such as computation node placement、graph split, communication strategy, etc.