Title: Intelligent scheduling and management in data centers


Technical Area: Resource Management


Background
In modern data centers, mixed workloads are running with varied resource constraints and requirements on Quality of Service (QoS). As more applications are deployed together, the resource utilization of the data center increases resulting in a reduced cost of the data center. However, the challenge of this resource sharing approach is the resource interference between colocated workloads that may lead to performance deterioration of applications. While for important tasks such as trade platform, search, advertisement, etc. such performance drop would cause loss of revenue in core business. A practical
method to address this issue is to colocate best effort (BE) tasks and critical workloads together and kill or throttle BE tasks whenever a contention occurs.


However, this approach still does not address the problem of SLO violation of latency sensitive applications as small amount of interaction in micro resources such as caches, memory, LLC, IO, etc. could cause significant latency rise. A promising way to solve this issue is to profile all of our applications and tasks based on the historical data from our monitors to have better knowledge on the resource utilization characteristics with respect to time, quantity, etc.

 

Based on the profiling, we could make optimistic placement decisions on which applications could be placed on the same machines and compensate the resource demands of each other’s, so that the resource utilization of our data center could increase while the SLO’s of applications could be well respected.


Besides resource profiling, real time resource interference diagnosis and performance analysis is another critical ability for modern data center to achieve QoS aware resource management. The goal of our resource management in Alibaba is to achieve data driven and intelligent resource management that respect the QoS for all application while optimizing the resource utilization.


Another challenge we are facing is the ever-increasing amount of heterogeneous resources (e.g. GPU, FPGA) in our data center, and we need to manage them more efficiently.


Target
The overall target of this collaboration theme is to establishing data driven resource management based on our huge amount of historical data and real time monitoring metrics. By integrating intelligence into our existing resource management framework, the resource utilization can be increase while the QoS
of all of our applications can be well respected. Moreover, we are also interested in the ability of anaging heterogeneous resource in a more efficient way.


Related Research Topics

There are four topics in this collaboration project. The first three belong to the scope the data driven resource management and the last one is related to effectively managing modern data center with emerging new technologies, i.e.NUMA and hardware such as GPU, FPGA and other AISCs.
1. Application resource profiling and workload analysis
Running mixed workload is an effective approach to increase resource utilization of modern center. However, resource interference between applications may cause significant performance deterioration such as increased latency and this is not acceptable for many critical services. To make more optimistic placement decisions about which applications could be deployed together, fine-grained resource  tilization profiling is required, so that we have better knowledge about which applications have  ompensating resource needs and which do not.


In this topic, we are looking for partners to profile the resource utilization characteristics of Alibaba’s internal applications and optimize our existing scheduling system based on the profile for placement decisions of higher quality
and promote the resource utilization of our production clusters without compromising the performance of running applications. To be specific:

 

2.Runtime interference diagnosis and analysis
Application resource utilization profile provides static reference for us in making placement decisions and optimizing resource utilization. While online interference diagnosis and analysis are equally mportant in data driven resource management. Observing a performance degradation of an application is simple via monitor, while the hard part is to identify the reason(s) behind.


This collaboration topic for partners or researchers to do the following with our production framework:


3. Intelligent resource management framework based application QoS
The goal of resource management system in data center is finding equilibrium between the QoS of  pplications and the cost of the resources that applications use. Based on application resource profiling and runtime interference diagnosis, the resource management system need to manipulate the resource allocation to ensure the SLO of applications while optimizing the resource utilization of the data center to host more applications.


In this topic, we are expecting partners or researchers to build a QoS aware resource management with us in our production resource management system.

 

To be specific:


4. NUMA and/or heterogeneous resources aware scheduling and resource management framework
NUMA is a hardware feature that can improve the performance of various of applications if used orrectly.

 

The key in adopting NUMA in production is to understand which applications are capable to run with UMA activated and how to find the optimal configuration. Ideally, this should be transparent to applications and the resource management system should hide all the complexity.

 

Another direction in this topic is based on the fact that heterogeneous resources such as GPU, FPGA, etc. have been used widely in modern data centers as the demand in machine learning or AI increases in lmost every business field. The problem of these heterogeneous resources is that they cannot be shared between multiple users due to lack of isolation and virtualization support.


The goal of this collaboration topic is as following: