Topic Title: Towards Effective Resource Utilization over Multi-tenant Large-scale Machine Learning Platform
Technical Area: AI System，Distributed System
Inside Alibaba, machine learning is widely used for various scenarios, such as in e-commerce recommendation、customer service QA、image search，machine translation、speech recognition and etc. To streamline the iteration of machine learning driven business, Alibaba computation platform developed PAI(with the full name being Platform of Artificial Intelligence) platform, which consists of hundreds of thousands of CPU cores and thousands of GPU computation devices. In addition to CPU and GPU, there are also FPGAs and domain-specific accelerators deployed on PAI targeting for special-purpose AI workloads.
There are several thousands of machine learning jobs running daily on PAI. And those jobs have diversified resource request characteristics. Some of them need huge amount of memory with less computation power requirement, while others are computation-bound with less memory footprint. There are also some jobs requiring distributed execution due to the hungry request of computation power surpassing that can be provided by a single computation node. Such jobs impose additional requirement on communication latency/bandwidth in addition to memory and computation.
For the same workload, it could be executed either on CPU/GPU/FPGA/etc. computing device, with different latency and throughput behavior. Also different workloads may have different priorities based on their business nature. For example, it is not a big issue if the execution of an offline data analytic job is delayed for 1 or 2 hour. However, it is a disaster if a core business model refresh pipeline is delayed for 30 minutes.
Taking the scale of computation cluster, diversified job characteristic, underlying heterogeneous computing hardware and business priority constraints into consideration, it is a challenging task for optimizing hardware resource utilization at cluster level for large-scale machine learning platform.
Although there are some successful experiences of optimizing cluster resource utilization, such as pipelining the data processing tasks to reduce idle time of computation devices, and place multiple tasks onto the same GPU device to squeeze the hardware utilization ratio. The cluster resource utilization is still an open challenge, especially when we consider the nature of machine learning workloads themselves. For example, distributed machine learning workloads usually involve frequent data transmission between computation nodes, also the degree of scalability of machine learning workloads may impact the model convergence trends, thus make distributed machine learning execution not a perfect “embarrassingly parallel” case, which is a characteristic usually leveraged for large-scale cluster resource utilization.
To optimize resource utilization of large-scale machine learning platform, a holistic view is necessary. And the following questions must be answered:
1. Due to the limitation of heterogeneous computing device (e.g. NVIDIA GPU has a hard rule about memory usage, when the GPU memory is used up, the running GPU application will just stop its execution. Although from NVIDIA Pascal architecture, page migration engine is brought for using host memory as a bigger backup pool for the more scarce GPU memory, the application performance will degrade significantly when page migration mechanism is triggered), when multiple workloads are put onto a single GPU computation device, careful strategy must be designed to work around such limitations.
2. The relationship between distributed scale and machine learning workload execution behavior must be taken into consideration, since machine learning workload usually doesn’t follow “embarrassingly parallel” paradigm.
3. The diversity of machine learning workloads makes resource utilization further challenging. A holistic and general view needs to be set up to accommodate the simultaneous execution of those diversified workloads.
Related Research Topics:
By collaborating with the research and academia community, it is expected that progresses can be achieved in the following areas:
1. Inter-workload optimization
There are two perspectives for saying inter-workload optimization. Firstly, it is expected that multiple workloads could be merged/combined as a single logical task and scheduled onto a single computation device to exploit the underlying hardware potential. Virtualization is one way for solving this challenge but not the only one.
Secondly, a typical machine learning workflow usually consists of multiple tasks, each with its own characteristic, a holistic and global view for optimizing the execution and resource usage for those multiple tasks is necessary since with bigger view more optimization space could be discovered. Also a uniform language or intermediate representation may be necessary for describing the execution nature of multiple tasks in a common way and based on this IR, aggressive optimization strategy can be taken. WeldIR proposed by Stanford and MIT is a motivating example but it is still at a quite early stage.
2. Priority based machine learning task scheduling
As mentioned earlier, not all jobs are created equal. So priority scheduling is necessary to ensure incoming workloads with higher priority will not be starved for quite a long time due to that the cluster resource is fully occupied by the earlier started lower priority tasks. Preemption scheduling is not as easy as it sounds for machine learning tasks.
3. Compiler oriented distributed optimization
We view the resource utilization problem of large-scale machine learning platform as a typical compiler oriented problem. Since what needs to be solved is to bridge the gap between the high-level machine learning workloads description and the high-performance execution of underlying hardware. The strategy of mapping distributed execution requests onto various heterogeneous computing devices can be regarded as a typical graph optimization problem. To be more aggressive, automating the process of transforming a single machine learning task description into a distributed execution plan can also be viewed as a graph optimization problem. And to ensure the high-level graph optimization is efficient enough, it’s also necessary to solve the low-level code generation problem. Since with graph optimization taking into effect, the boundary between different components are broken, thus bigger optimization space is exposed, but also the existing implementation of those components themselves can not be reused, so code generation techniques are necessary to generate execution for the bigger picture. It is hopefully that the code generation work can be achieved in a principal way rather than a case-by-case one.