Title: Research on failure prediction and proactive fault tolerance in large-scale data centers and storage systems

 

Technical Area: Intelligent Operation and Maintenance

 

Background

Nowadays, data centers are usually composed of a large number of servers and clusters with tens of thousands of high-capacity HDDs and SSDs, so as to provide the storage for fast increasing data volume. Reliable IT infrastructure is very critical in large-scale data centers and cloud computing environments while disk failures in data centers are prevalent in practice. Thus, how to store data reliably turns out to be one of the most fundamental requirements faced by data centers.

 

Predicting drive failures and sector errors before they actually occur allows us to handle them in advance, which can greatly enhance the storage system reliability and avoid performance degradation. For example, once a disk is predicted to fail, the storage systems can replace it in advance, thereby reducing the data loss probability and avoiding massive data transfers in failure recovery. Most of existing work proposes to predict disk failures and sector errors based on Self-Monitoring, Analysis and Reporting Technology (SMART) [2], which can monitor internal attributes of individual drives and properly reflect the health status of drives. In the last two decades, a number of statistical and machine learning approaches have been proposed to build failure prediction models based on SMART attributes, most of which fall into the following categories: 1) whole-disk failures prediction [3][4]; 2) latent sector errors prediction [5]; 3) SSDs errors prediction [5]; 4) proactive fault tolerance [6][7].

 

Although there have been many studies in the literature on failure prediction, we argue that state-of-the-art prediction approaches cannot properly conform to the requirements in practical deployments. First of all, most of existing studies only support off-line prediction by using all the historical data to predict one failure event, while practical storage systems should actually prefer on-line prediction, which can iteratively collect new experienced failure events and retrain the model for more accurate prediction in next observation periods. Second, failure correlations among different factors are still largely unexplored. These correlations can indicate how much impact of each factor on failure events and how different factors may co-occur together. Once being learned, these correlations not only help the construction of the prediction model to improve prediction accuracy, but also facilitate the design of efficient proactive fault tolerance mechanisms. Third, many commodity storage systems have extensively used erasure coding to provide fault tolerance at the system level, while many proactive fault tolerance mechanisms still choose replication for fault tolerance. Replication will definitely incur more storage overhead, but it introduces less repair traffic (i.e., the amount of data read for recovery) than erasure coding. How to elastically integrate both replication and erasure coding in proactive fault tolerance design, so as to fully leverage its high storage efficiency while reducing the repair traffic at the same time, is still an open issue.

 

To this end, following three key issues in failure prediction and proactive fault tolerance should be studied. First, design on-line learning algorithms that can iteratively retrain the training model by taking the new collected events as input and timely output the prediction results. Second, investigate failure correlations among different factors, the results of which will help to design more accurate prediction models by enlarging the impact of most correlated factors and lowering the impact of least correlated factors. Third, study a new proactive fault tolerance mechanism based on replication and erasure coding, so as to balance storage overhead and recovery performance.

 

Target

The targets of this research topic are listed as follows:

1. Online Learning Algorithms

2. Measurement Analysis on Failure Correlation

3. Proactive Failure Tolerance

 

Based on aforementioned goals, Alibaba team can properly and accurately profiling and evaluating the healthy status of large-scale storage systems, as well as servers and clusters in data centers.

 

Related Research Topics

1. Online Learning Algorithms

Online learning algorithms should be inexpensive and convenient. As any single disk rarely fails in an observation period, so, build a healthy model to simulate the healthy degree of each disk has a higher priority, comparing to directly predicting a disk will fail or not.

 

Online learning model should also be adaptive and it can be updated based on feedbacks.

 

2. Measurement Analysis on Failure Correlation

Measure analysis on failure correlation will make the model become more explainable. There are two kinds of factors: internal factors and external factors. Internal factors often refer to the different inherent manufacture designs of disks and the aging process of internal components of a disk. On the contrary, external factors are usually determined by the environment that the disk is hosted and the workload loaded on the disk. Correlation coefficients and methods should be applied to conduct a systematic measurement to uncover the internal correlations between failures and different factors in the systems.

 

3. Proactive Fault Tolerance

How to avoid data loss of a soon-to-fail disk should also be placed with special attention. As it’s difficult to give out the explicit time of disk or server failure, design some solution to protect data, such as erasure coding, based on accurate failure prediction is very useful.

 

In order to make such solution become measurable before deployment into production environment, trace-driven simulations should be designed to conduct “what-if” analysis and justify the effectiveness of proactive fault tolerance based on real traces.