The Alibaba Tech Team’s “nuclear weapon” in testing technology
Spanning numerous business sectors and supported by a multitude of service systems, the scale of Alibaba’s operations poses a big challenge for the technical team when conducting capacity planning. This challenge is biggest during promotional events such as the Double 11 Global Shopping Festival, when Alibaba’s systems experience sudden and huge spikes in traffic. Maintaining site availability in such situations requires careful planning and perseverance in problem-solving.
Capacity planning for Double 11 aims to answer two key questions: What is the projected traffic volume during the event, and how many machines will be needed to support that volume? Answering the first question is a straightforward case of using prediction algorithms and looking at historical data.
In principle, the second question is also simple to answer – a case of determining the per-system machine capacity and then dividing the projected traffic volume by this figure to calculate the minimum number of machines required. Adding further machines to this figure allows for a capacity buffer, based on the assumption that the figures used for the calculation are ultimately best guesses.
In practice, the answer is more problematic. Years of experience with Double 11 events have shown Alibaba that this approach falls short of accurately predicting the number of machines required. Since the calculation assumes that each machine functions in isolation, it neglects to consider how the system, as a whole, responds to increased traffic during Double 11.
To address this problem, the Alibaba tech team introduced “full-scale stress testing” as an additional stage in their capacity planning process. This key step simulates the same business scenario and traffic volume as Double 11 across the whole platform, painting the team a more realistic picture of capacity requirements. Since traffic can still fluctuate unexpectedly during Double 11, the tech team also developed traffic control mechanisms to mitigate the problems that arise at peak capacity.
In summary, having honed capacity planning skills for Double 11 down to the minutest of details, Alibaba developed a four-stage capacity planning process in which the full-scale stress test is key to ensuring smooth operations even during high-traffic events. After obtaining the projected traffic volume for the event, the remaining three stages are:
Per-system machine stress testing can be achieved in four ways: request simulation, request replication, request forwarding, and load balancing adjustment. Each of these methods fit the needs of specific scenarios, but are also accompanied by certain drawbacks.
Relatively easy, query simulations can be produced through open source or commercial tools like Apache Bench, Webbench, http_load, Apache Jmeter, and LoadRunner, and are better performed on unlaunched or low-traffic systems. This is due to the impact of error margins between simulated and real requests on the stress test’s structure, and the potential pollution of backend stores of data.
Though it picks samples from an actual operating environment, query replication runs the risk of pollution, and requires the copied requests to intercept pings through a specially earmarked machine, making it ideal only for systems receiving fewer queries.
On the other hand, diverting and forwarding queries from distributed systems to a single machine increases traffic without the use of written queries and provides highly accurate test results with no data pollution; this is also the most commonly used method in Alibaba. Convenient as it may be, it does require a significantly large volume of queries, without which it cannot determine the precise bottleneck values.
In load balancing, a designated machine in a distributed environment issues more requests, but with the weight of the load balancing device calibrated. This produces accurate results and no data pollution, but as is the case with request forwarding, it requires a large quantity of queries (within a distributed system) to be effective.
Alibaba uses an automated platform based on these four methods to conduct scheduled or manually triggered stress tests, numbering over 5,000 a month. The platform detects system loads in real time, terminating the test whenever preset thresholds are breached. Stress tests also take place before system releases or major updates. With the estimates for request capacity and knowledge of the service system’s abilities in hand, the tech team efficiently calculates the approximate number of machines required.
The team has learnt the hard way that the rough calculations provided by per-system machine stress testing may be a good starting point, but does not guarantee performance during high-traffic events. Back in 2012, when the clock struck twelve on the eve of Double 11, many system operations performed worse than anticipated. The actual volume of online users and transactions was much higher than expected, and the interdependence of the systems worsened the issue. Faced with error pages, many customers were ultimately forced to abandon their carts.
Partly due to this experience, the Alibaba team likens Double 11 to a final exam at the end of an academic year – a test of the technical team’s preparedness. Following that logic, the team proposed holding mock exams that simulated the post-midnight traffic on Double 11 to assess the site’s capacity, performance, and bottlenecks. Thus, an ambitious new stress test platform to evaluate on a system-wide scope was developed.
A Double 11 simulation is a complex affair – almost a billion users coming online, browsing, and purchasing several million types of goods – and comprises three major features:
Issuing 10 million queries per second or more requires a flow control platform (deployed using internally-developed test engines for thousands of worker nodes) that can handle a massive and controllable quantity of requests on demand. The simulation also needs natural user behavior, for which the team obtained basic data (buyers, sellers, commodities, promotions, etc.), and filtered it for use in a full-scale stress test. This basic data, combined with historical data from previous years, has already helped predict this year’s Double 11 model for simulation tests. This model encompasses over a hundred business factors – the volume and types of buyers, sellers, commodities, and their various interactions with each other and the site.
Full-scale stress tests copy online data, which is first cleaned to remove any sensitive information and is then generated into simulated requests. Since simulated requests introduce a lot of dirty data into the production environment, distinguishing between the real and the simulated requests becomes even more important. This is achieved by storing dirty data in an isolated shadow area, and tagging each request to qualify it for support from all middleware protocols.
The full-scale stress test has been a resounding success – within the first year of initiating the new testing process, the team uncovered over 700 issues, with hundreds more in subsequent years.
Today, for each big promotion, several rounds of full-scale stress tests are undertaken to ensure a complete simulation of the whole system and reveal any hidden problems.
Capacity planning, even when founded based on a meticulous, precise business model, is predictive in nature and thus, prone to imprecision. A recent example of this would be the 2016 Double 11, where the capacity preparation for a peak value estimate of 142,000 requests per second was exceeded by almost 13% on the actual day of the event, which saw requests per second go over 160,000. Pushed to the limits of operating capacity, the machines processed requests slower, resulting in a less-than-stellar user experience causing a search request resubmission loop, ultimately resulting in shutdown.
Post-shutdown on a single machine, standard load balancing mechanisms redirect queries to other saturated machines. This can cause those machines to also become overloaded, leading to an ‘avalanche effect’ that is difficult to contain.
This highlights the importance of monitoring flow, not only during peak traffic periods but in delayed response time within network links. Beyond qps, flow control needs can be assessed by measuring response time and load. Flow methods can be defined and adopted flexibly – requests can either be abandoned or queued, and downstream applications themselves can be downgraded or included in a blacklist.
Though systems are safe in flow control, they’re still considered to be damaged, thus requiring downtime for full recovery. Traditional methods aren’t ideal either, since the fixes don’t alleviate the actual issues behind the overloading. For both fast recovery and system stability, the team dynamically stabilizes the response time, load, and permitted qps rate, mitigating any user experience interruptions.
The full-scale stress test is a watershed in backend readiness for Double 11 and, along with flow capacity control, is an integral part of Alibaba’s preparedness arsenal. Through its full-scale stress test, the team was able to stabilize the system’s response to high-volume traffic, plan resource expansion more effectively, and drastically shorten recovery periods.
After several years of internal use, the full-scale stress test will be made available as a service on AliCloud come June 2018, allowing enterprises to easily verify the availability and capabilities of all web properties. The technology will also be presented by Aliware’s senior technical experts Zhang Jun (You Ji) and Lin Jia Liang (Zi Jin) at this year’s Asia/Australia leg of SREcon in Singapore.
. . .
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook