Title: Fault Tolerance, Root-cause Analysis and Self-healing in Large-scale Distributed OLTP Systems


Technical Area: Intelligent Operation and Maintenance



Alibaba is the most popular destination for online shopping in China, hosting several hundred thousands of servers to process trillions of web requests from the end users. Due to the complexity of its business logic, Alibaba's engineering team has adopted the micro-service architecture since its very early stage for scalability and separation of concerns.


A typical web request such as "place an order" and "add to the shopping cart" today requires the participation of more than 20 web services on the backend, and troubleshooting production issues or finding performance bottlenecks among these services is not easier than find a needle in haystack.


Figure 1 Alibaba Micro-Service based Service Map in 2012, without a proper tracing framework, systems issues are extremely difficult to resolve.


Thus, we need the traffic models from multiple dimensions, such as models how different business scenarios traffic interfere with each other, models between different nodes across different layers, relations between traffic and capacity etc.


Also, hardware or software have a certain error ratio inherently, which will bring even greater challenges to the availability of large-scale enterprise platforms and SaaS.


Alibaba’s middleware team has built three systems to resolve these issues.

1.Distributed Tracing System:

By instrumenting middleware components(e.g. Web/RPC framework and Database client), this system is able to record execution path of web requests across multiple servers, helping developers at Alibaba to better diagnose issues in production environment. By aggregating these data, we will be able to reconstruct the "service map" of the infrastructure, allowing developers to quickly identify the "performance bottleneck" of the entire application.

2.Traffic model with multiple dimensions:

With the rapid development of business, our system and the traffic across the system have grown huge as well. However, we do not have a clear view of how this traffic flows through the system. We need a multi-dimensional analysis traffic model. Base on this model, we can dynamically adjust the traffic and make full use of system resources; isolate access points to remote services; and stop cascading failures in complex distributed systems.

3.Fault injection System:

Fault injection is an effective fault-tolerant testing approach by proactively injecting failure to the infrastructure. The approach has already been widely used in the area of aerospace, automotive and industrial. In the area of hardware and software, fault injection can quickly identify shallow bugs caused by a small number of independent faults, it helps us verify the overall availability of the entire application.


These three systems have been up and running for years, they proved to be very useful for developers to prevent, triage and eliminate large-scale distributed systems. However, we still face many challenges:


1.The ultimate goal of our distributed tracing system is to identify the root cause of any PaaS-layer issue automatically. However,

1) At this moment developers at Alibaba could only be able to troubleshoot issues by sifting through hundreds of diagrams provided by us, trying to identify the culprit of an issue manually - that requires a significant amount of troubleshooting experience and previous knowledge of a particular system.

2) Also, there is always a tradeoff between granularity of the data we collect and the cost to process and store them. Storing the details of every single method call reveals more information about the system, but that means hundreds of PBs of tracing data to index and store every day, which is not feasible. Sampling and streaming data in a non-intrusive and "smart way" is the most challenging problem to resolve.

3) We have not leveraged non-structured data yet: such as application logs and application code.


2.Our goal is to have a clear view of our company’s traffic. However, there are following challenges: 

1) Our system is composed by hundreds of applications, and each application is composed by thousands of machines. What makes the situation even more complex is that, one systems might carry a number of business scenarios. These business scenarios have a certain relevance by users’ behaviors. For example, a user can successfully create an order, he may need to complete a number of actions, such as browsing items, submit order, etc.

2) There is no clear insight view of the relation between the incoming traffic and the distributed system capacity. The traditional way is to calculate the load of a single machine, however there are 2 defects of doing so: first it will result in a jagged system, which is not conducive to the stability of the system. Secondly, the system capacity is often determined by the bottlenecks of all the involved applications, which is hard to detect.


3.Fault injection is difficult to act on distributed systems, for the complex combinations of multiple instances and types of faults, we now encounter the following problems:

1) In a distributed system, the faulty testing path can expand dramatically, it is hard to cover completely. Following is a sample:

2) Random fault injection may bring uncertainty to the business, but it is hard to measure the impact accurately and universally.

3) The complex test path will cost lots of time to test, we need to make it automatically.

4) The availability of a single component does not represent global availability, but we need to ensure the globally availability.



1.The objectives of the research on the Distributed Tracing System topic include:

1) A non-intrusive/low overhead approach to detect anomalies/outliers in traces. This approach shall not require previous knowledge of the application code.

2) Use machine learning techniques to correlate hundreds types of metrics, trace, log and application code collected by the system, triage the root cause in a more proactive way.


2.The objectives of the research on the Traffic Model with multiple dimension topic include:

1) A self-adaptive approach to profile traffic with multiple dimensions. The profiling should be flexible, easy to adapt.

2) A real-time isolation of points of access to remote failure systems, services and 3rd party libraries, stop cascading failure and enable dynamic adjustment.


3.The objectives of the research on the Fault Tolerance/Fault Injecting topic include:

1) Easy-to-follow fault injection model which can be easily constructed in reality (need considering language, operating system etc.)

2) Distributed dependency brings inflated test space, we need efficient algorithms to narrow down the search space, and also trace back paths to assist in locating problems

3) Systematic the fault injection process. It will automatically discover all service entries, automatically learns new fault points caused by cascade after fault injection.


Related Research Topics

1.Distributed Tracing System:

1) Canopy: An End-to-End Performance Tracing and Analysis System Symposium on Operating Systems Principles (SOSP) By: Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song

2) Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach. Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan. To appear in the 26th ACM Symposium on Operating Systems Principles (SOSP’17), October 2017, Shanghai, China.


2.Traffic Model with multiple dimension

a) Hystrix: Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

b) Istio: Control traffic between services with dynamic route configuration, conduct A/B tests, release canaries, and gradually upgrade versions using red/black deployments.


3.Fault injection can be traced back to the 1970s, it is used for the simulation of hardware faults (HWIFI Hardware Implemented Fault Injection). Then it is gradually used at the software level (SWIFI Software Implemented Fault Injection).

1) The classic paper about fault injection is NASA's "Fault injection techniques and tools" in 1997. The structure of the fault injection system has been initially defined in the paper.

2) Peter Alvaro, Associate Professor at the University of California, Santa Cruz, published a paper "Lineage-driven Fault Injection" in 2015. It describes a prototype that can efficiently test the robustness of fault-tolerant protocols, and it is used to verify 14 commonly used fault-tolerance protocols to find 7 bugs.