Computing resources are essential to both online services and offline batch jobs. But a surprising amount of resources are sitting idle at any given time. According to Geithner and McKinsey, global servers only use 6 to 12 percent of their CPU. For Alibaba's online services, average daily utilization rates were only about 10 percent until the tech team introduced their colocation solution to address the issue.
Through colocation technology, Alibaba can effectively utilize those previously underutilized resources. This is achieved by exploiting the different properties of online services and batch jobs. By employing colocation on their servers, Alibaba has increased their CPU utilization from 10 percent to up to 40 percent – an impressive 30 percent increase.
Colocation boosts cluster resource usage by 30 percent
Alibaba first started researching colocation technology for large-scale use in e-commerce in 2014. The first large-scale use came in 2017, when 20 percent of traffic during the Double 11 Global Shopping Festival was handled using the technology.
Alibaba's colocation technology: from research to implementation
Better CPU Utilization Through Colocation
Colocation works by scheduling different task types to the same physical resource. This means combining clusters and using control methods such as scheduling and resource isolation to fully utilize resource capabilities. Alibaba's colocation technology does this while safeguarding system landscape optimization (SLO).
Colocation technology and CPU resources
To be eligible for colocation, tasks must:
Colocation can be applied effectively to online services and offline batch jobs by exploiting the different computing requirements of each.
Online services require large amount of resources at peak times – such as during Alibaba's Double 11 Global Shopping Festival – but require far fewer resources outside these periods. This means that outside of peak periods, large amounts of resources can go unused. Global servers only use 6 to 12 percent of their CPU. Online service usage is higher during the day than during the early hours, and online services must continue uninterrupted.
Batch jobs take up a large amount of resources, especially CPU time. The peak time for batch jobs is early in the morning.
Online services and batch jobs therefore meet the eligibility criteria for colocation:
A useful way to understand how colocation better utilizes CPU resources for online services and batch jobs is by using the metaphor of sand and stones.
Online service resource demands can be likened to stones held within a container. Outside of peak times, an online service only needs to use the stone resources within the container and does not need to use the gaps between the stones.
Batch jobs demands can be likened to sand that fills the gaps between the stones in the container. As online services do not usually need to use this space, the batch jobs sand can fill this space. But when demand for online services rises, the batch-job sand leaves the container so online services can use the extra resources. In this way, the higher priority task type – in this case online services – can continue uninterrupted.
When demand for online services is high, online services use more resources. The green "stones" represent online services while the blue "sand" represents batch jobs.
Calculating the Benefits of Colocation
The amount of resources that colocation saves is calculated according to a simple formula.
Suppose that a data center has "N" servers, and the utilization rate ("R") increases from R1 to R2. Without taking any other restrictions into account, the number of servers ("X") that can be saved is calculated as follows:
It follows that if you have 100 thousand servers and your resource usage rate increases from 28 percent to 40 percent, you can save 30 thousand servers. In other words, colocation can save up to 30 percent of resources.
This is corroborated in a 2015 paper published by Google entitled "Large-scale cluster management at Google with Borg". In this paper, Google describes a solution using similar technology to the colocation solution developed by Alibaba, and notes that they were able to save 20 to 30 percent of their computing resources using this solution.
The following figure shows the architecture of Alibaba's colocation scheduling solution.
Colocation scheduling architecture
The three key aspects of the solution are two scheduling clusters, which operate autonomously, and the Level0 mechanism which coordinates between them:
Sigma manages scheduling for the online services container. It:
Fuxi manages batch jobs. It is:
The Level0 mechanism, shown in the diagram below, coordinates and manages the cluster for smooth and stable operations. The mechanism manages:
Colocation Resource Isolation
For colocation to work effectively, it is essential for resource isolation to be carried out properly. If resource isolation is not carried out effectively, then competition for resources can not be resolved effectively.
This can then lead to issues with online services. At best, this can damage the quality of user experience. At worst, it can mean online services fail to be provided.
The following two approaches can be taken to avoid issues with resource competition:
Scheduling uses resource representation technology to predict and plan resource demands before resource competition arises. The aim of scheduling is to reduce the probability of resource competition arising. Scheduling can be constantly optimized.
Scheduling can be optimized according to the following methods and considerations:
For online services and batch jobs, the peaks and troughs of resource demands occur at different times of the day. Time-division multiplexing can therefore be used on a daily basis to improve resource utilization.
Peaks and troughs for online services and offline batch jobs
Promotion Time-Division Multiplexing
The e-commerce business involves huge online promotions, such as the Double 11 Global Shopping Festival. During large-scale promotions, or stress testing, pressure on resources will be quite a few times higher than during normal operations. If at that time, we reduce resources for computing tasks and give all those resources to online services, then we can deal with that short term burst of demand for resources quite effectively.
Allocation of colocation resources according to demand
When downgrading batch jobs, the impact should be reduced as much as possible. For lossless downgrading, attention should be paid to the downgrade plan.
Because online services do not usually use many resources, adding 70 percent of computing task resources takes less than three minutes. This means that in the first five minutes after a downgrade, the impact of batch jobs on online services should not be too large.
Another issue is minute-level recovery. If minute-by-minute recovery is implemented successfully, then there will only be effects at the real peak points. As the peak point period is very short, the effect will be minimized.
Before we used the metaphor of sand to describe batch jobs. But even grains of sand vary in size. To make sure gaps in resources are used without overflowing, we need to filter and select the sand. This can mean:
Computing task selection process
As previous memory resources did not consider colocation, internal memory and CPU are matched to the original demands of online services, without any surplus internal memory.
Due to increasing demand, internal memory reached a bottleneck, and static distribution of internal memory was no longer sufficient. To address this, dynamic distribution of internal memory was introduced. This can mean:
Computer storage separation
Different storage solutions are used for online services and batch jobs:
When using colocation, if the local drive is still being used to process data, but batch jobs are mixed together, then scheduling becomes significantly more complicated. To resolve this, we need to make a virtual memory pool from the local drives.
In this way, we can access different storage devices according to demand through remote access.
Alibaba has also started to build 25G network capacities on a large scale. This will improve overall network capability. It will also allow remote access to become as fast as local access.
Computing and storage scheduling
If resource competition occurs, task prioritization can be used to decide how to prevent impact on high priority tasks and minimize impact on lower priority tasks. This is a last-resort measure to be taken in extreme circumstances.
Kernel isolation can be implemented according to the following methods and considerations:
This is the most important kernel isolation item. When stress rises for CPU online services, batch jobs must withdraw within milliseconds.
As the red arrows show, when online services need more resources, batch jobs must immediately use less resources
CPU scheduling optimization includes:
This means assigning priority levels according to CGroup. High priority tasks can take timeslots from low priority tasks.
This means avoiding offline tasks being scheduled to HTs adjacent to online tasks and ensuring that offline tasks scheduled to HTs adjacent to online tasks are moved after being woken.
Use the BDW CPU featured CAT to carry out cache access traffic control. In this way the CPU use of low-priority tasks can be limited.
Memory bandwidth includes implementing strategized adjustments according to real-time monitoring and controlling CFS bandwidth by adjusting the operation timeslot length for batch jobs. Giving tasks a shorter timeslot allows high-priority batch jobs to use more CPU resources.
A blkio controller as well as IOPS and bits per second (BPS) constraints have been added for file-level IO bandwidth isolation (upper level). File-level low bandwidth thresholds (lower level) allow use of surplus free bandwidth once the low bandwidth threshold has been crossed. The metadata throttle limits specific metadata operations, such as deleting large amounts of small files at once.
Network traffic control includes bandwidth isolation, which includes local bandwidth isolation and bandwidth isolation between Pouch containers.
It also includes bandwidth sharing, which is divided into gold, silver, and bronze levels. Shared bandwidth can exist offline, and bandwidth can be taken according to priority.
After four years of development, colocation managed 20% of Alibaba’s Double 11 Global Shopping Festival's traffic in 2017. Over the coming year, colocation will continue to evolve.
Colocation will be able to be used in more diverse applications such as real-time computing, GPU or FPGA. Colocation will also increase in terms of scale. For resource representation capacity, deep learning will be used to improve prediction accuracy and utilization rates.
Optimization of prioritization systems will further improve scheduling capacity. Scheduling will then be performed by priority levels rather than by online services and batch jobs. In the future, colocation can become a universal capability for scheduling systems.
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook