Title: Intelligent scheduling and management in data centers
Technical Area: Resource Management
In modern data centers, mixed workloads are running with varied resource constraints and requirements on Quality of Service (QoS). As more applications are deployed together, the resource utilization of the data center increases resulting in a reduced cost of the data center. However, the challenge of this resource sharing approach is the resource interference between colocated workloads that may lead to performance deterioration of applications. While for important tasks such as trade platform, search, advertisement, etc. such performance drop would cause loss of revenue in core business. A practical
method to address this issue is to colocate best effort (BE) tasks and critical workloads together and kill or throttle BE tasks whenever a contention occurs.
However, this approach still does not address the problem of SLO violation of latency sensitive applications as small amount of interaction in micro resources such as caches, memory, LLC, IO, etc. could cause significant latency rise. A promising way to solve this issue is to profile all of our applications and tasks based on the historical data from our monitors to have better knowledge on the resource utilization characteristics with respect to time, quantity, etc.
Based on the profiling, we could make optimistic placement decisions on which applications could be placed on the same machines and compensate the resource demands of each other’s, so that the resource utilization of our data center could increase while the SLO’s of applications could be well respected.
Besides resource profiling, real time resource interference diagnosis and performance analysis is another critical ability for modern data center to achieve QoS aware resource management. The goal of our resource management in Alibaba is to achieve data driven and intelligent resource management that respect the QoS for all application while optimizing the resource utilization.
Another challenge we are facing is the ever-increasing amount of heterogeneous resources (e.g. GPU, FPGA) in our data center, and we need to manage them more efficiently.
The overall target of this collaboration theme is to establishing data driven resource management based on our huge amount of historical data and real time monitoring metrics. By integrating intelligence into our existing resource management framework, the resource utilization can be increase while the QoS
of all of our applications can be well respected. Moreover, we are also interested in the ability of anaging heterogeneous resource in a more efficient way.
Related Research Topics
There are four topics in this collaboration project. The first three belong to the scope the data driven resource management and the last one is related to effectively managing modern data center with emerging new technologies, i.e.NUMA and hardware such as GPU, FPGA and other AISCs.
1. Application resource profiling and workload analysis
Running mixed workload is an effective approach to increase resource utilization of modern center. However, resource interference between applications may cause significant performance deterioration such as increased latency and this is not acceptable for many critical services. To make more optimistic placement decisions about which applications could be deployed together, fine-grained resource tilization profiling is required, so that we have better knowledge about which applications have ompensating resource needs and which do not.
In this topic, we are looking for partners to profile the resource utilization characteristics of Alibaba’s internal applications and optimize our existing scheduling system based on the profile for placement decisions of higher quality
and promote the resource utilization of our production clusters without compromising the performance of running applications. To be specific:
- Profile resource utilization of Alibaba core applications and common applications.
- Based on the outcome of the profiling, improve existing scheduling algorithm and the resource management policies.
- Formulate a solid and reliable basis of our data driven resource scheduling and management system.
2.Runtime interference diagnosis and analysis
Application resource utilization profile provides static reference for us in making placement decisions and optimizing resource utilization. While online interference diagnosis and analysis are equally mportant in data driven resource management. Observing a performance degradation of an application is simple via monitor, while the hard part is to identify the reason(s) behind.
This collaboration topic for partners or researchers to do the following with our production framework:
- Given the performance variation for core applications, quickly identify the reason behind, such as resource interference, host abnormality and mis-configuration, etc.
- In case of resource interference, identify the victims and propose optimal solution such as throttling, rescheduling, etc.
- Based on the interference analysis, improve the scheduler algorithm to avoid putting resource interference.
3. Intelligent resource management framework based application QoS
The goal of resource management system in data center is finding equilibrium between the QoS of pplications and the cost of the resources that applications use. Based on application resource profiling and runtime interference diagnosis, the resource management system need to manipulate the resource allocation to ensure the SLO of applications while optimizing the resource utilization of the data center to host more applications.
In this topic, we are expecting partners or researchers to build a QoS aware resource management with us in our production resource management system.
To be specific:
- Based on the resource profiling and interference analysis, with constant watching the QoS of core applications, design a “observe-evaluate-action” closed loop to ensure the QoS and SLO of core applications
- Build the ability to predict QoS or SLO violation and take actions accordingly before the performance of core application is effected by various failures.
- (optional) We are interested in adopting AI in our resource management framework and explore the possibility of let AI learn scheduling policies itself instead of current heuristics based algorithms. This is an open field and we sincerely welcome researchers with good insight in this field to collaborate with us.
- (optional) Serverless / Function-as-a-Service brings challenge to existing resource management methods. We also look forward to working with experts in this field to explore with us and deal with the challenges.
4. NUMA and/or heterogeneous resources aware scheduling and resource management framework
NUMA is a hardware feature that can improve the performance of various of applications if used orrectly.
The key in adopting NUMA in production is to understand which applications are capable to run with UMA activated and how to find the optimal configuration. Ideally, this should be transparent to applications and the resource management system should hide all the complexity.
Another direction in this topic is based on the fact that heterogeneous resources such as GPU, FPGA, etc. have been used widely in modern data centers as the demand in machine learning or AI increases in lmost every business field. The problem of these heterogeneous resources is that they cannot be shared between multiple users due to lack of isolation and virtualization support.
The goal of this collaboration topic is as following:
- Design NUMA aware scheduling mechanism based on our existing scheduling & resource management system.
- Adding statistics related to NUMA in application resource profiling and optimize the NUMA aware scheduling framework based on historical data.
- Explore with us on how to efficiently manage heterogeneous resources in our data center to enable sharing, multi-tenant, etc.