Topic Title: Progressive OLAP Engines for Batch Data Analysis
Technical Area: OLAP Query Processing
Over the last few years, there has been a significant increase in the amount of data that is being collected and analyzed every day to support various data-driven decisions. A major challenge in processing these massive amounts of data comes from the fact that the underlying hardware cannot get fast or cheap quickly enough to keep up with the data growth, which in turn means that cost of decision making will keep increasing. As OLAP queries usually touch a significant amount of data, this mismatch between data growth and computation power shows poses significant challenges to the traditional batch execution model of OLAP engines.
For instance, as a large-scale data processing platform, MaxCompute (a.k.a ODPS) is one of the most critical infrastructures at Alibaba Group. MaxCompute serves the majority of Alibaba’s daily data analysis jobs. A large portion of these data analysis jobs are recurring jobs for mission-critical business reporting and analysis, and have stringent deadline SLO. However, as most of these jobs are scheduled at the end of a day/week/month, sudden peak usage of the cluster put much pressure on the computation resources. Moreover, as data analysis demands are ever increasing, making the deadline SLO become more and more challenging. Our analysis shows that such inefficient cluster usage pattern is largely due to the limitation of the batch execution model of existing OLAP engines. Actually, the huge amount of input data for these analysis jobs are usually accumulated through a long period of time, and a substantial part of the data is made available for analysis pretty early. However, the analysis jobs have to wait until all the input data is collected and then start execution, leaving the cluster resources underutilized outside the peak houses of a day.
To tackle the above-mentioned challenges, adopting the progressive query execution model in OLAP engines can be a very promising approach. Different from the traditional OLAP processing model, a progressive OLAP engine can start processing a query even with partial input data. As more input data is available, the query engine carries out the computation progressively, and can pause and resume the execution multiple times. And eventually, when all the input data is processed, the accurate query results are delivered as if the input data was processed all at one shot. This progressive execution model can enable MaxCompute evenly distribute the computation temporally, accordingly to the progress that the input data is collected and accumulated, and help the scheduler fully utilize the cluster resources, alleviating the deadline pressure.
We propose to systematically explore the direction of incorporating the progressive query execution model into the MaxCompute platform. We foresee that this new query execution model will require a re-thinking of the design of many fundamental components of data processing platforms, ranging from the query planner, the execution runtime, the job scheduler, to the caching and storage system.
We invite researchers from database and distributed systems areas to join this effort. We accept proposals of research topics related to one or multiple fore-mentioned research objectives:
- New query planner designs that take into consideration of the fact that some input data is made available incrementally, and generate progressive query plans.
- New schedulers that are aware of the temporal aspect of a job’s readiness, and schedules different phases of progressive jobs accordingly.
- New execution runtimes that can checkpoint and resume their executions efficiently.
- New caching/storage systems that are designed and optimized that support efficient checkpoint/management/restoring of the intermediate running states of the execution runtimes.
Related Research Topics
This project is related to several research topics to the fields of database and distributed systems, including online aggregation, approximate query processing, and delta view maintenance.
There is substantial work on online aggregation and approximate query processing. Existing work in both research topics adopts a methodology of computing approximate query results based on partial data, and focus on optimizing the query accuracy given a time for query processing or space budget for storing samples. Differently, progressive query execution does not require delivering approximate query results based on partial results. Instead it focuses more on incremental execution when more data is available.
Incremental view maintenance (IVM) is a very important topic in database view management, and has been studied for over three decades. IVM focuses on a similar problem—computing a delta update query when the input data is updated. However, majority of work in this area only tackles simple SPJA queries without nested and correlated aggregation subqueries. However, progressive query execution is looking for a more general solution that can be used for general-purpose large-scale data processing platforms.