Topic Title: Auto Cloud-Scale Data Warehousing


Technical Area: Machine Learning and Big Data Processing



MaxCompute, also known as ODPS, is the main data warehouse infrastructure in Alibaba Group. It serves the majority of Alibaba’s data analysis need, hosting over 5 million tables, and running over 5 million jobs daily, and these numbers keep growing rapidly. It is not surprising at all that the management work of such huge amount of data and jobs are highly challenging, ranging from cleaning up raw data, tightening data privacy, managing and storing data, developing analysis pipelines for business requests, to budgeting resources for different analysis applications.


However, despite the scale and complexity, this management work is still carried out by humans, lacking global coordination, and heavily relying on experiences.


Automating the data warehouse administration work is a long studied research area for traditional RDBMS. The rich literature in this field show very promising results how such automation can help improve the efficiency of data processing and resource usage.


However, a cloud-sized data warehouse such as MaxCompute poses new challenges and opportunities, which are very different than traditional data warehousing automation. The differences are mainly three-fold:

1) Due to the sheer size of the data managed, a cloud-sized data warehouse is usually geo-distributed. The space of physical design has very different trade-off dimensions (such as data locality and replication), and has very different set of tuning knots (e.g., global secondary indexes are generally more expensive in clouds).

2) The shared nature of clouds determines that the tables and jobs are from independent users/groups lacking of coordination. However, optimizing the data warehouse globally can bring way more benefits than local optimizations.

3) The large amount of recurring business reporting jobs brings the temporal dimension into the optimization space. Many optimizations have to consider temporal trade-offs. For example, such as materializing views to share computation has to consider when to materialize and how long to keep the view.



We propose the topic of auto cloud-scale data warehousing, and invites researchers from different fields to join our effort. We accept proposals including but not limiting to the following topics:

1) Geo-distributed data placement and replicate strategies.

2) Indexes selecting.

3) Common Sub-Graphs matching and combining in data warehouses.

4) Fields relationships mining for table corpus.


Related Research Topics


Indexes suggestion: Clustering indexes are supported by MaxCompute to reduce the task instances in shuffle stages. A clustering index can be made of several fields, but only one clustering index can be defined for one table. Considering the cases that a table joins with many other tables on different fields, it is important to build a cost model and figure out the best definition of the clustering indexes for these tables.


Common sub-graphs matching and combining: As mentioned above, there are strong relationships between fields and tables in MaxCompute. If we draw these dependencies in one paper, it would be a directed acyclic graph (DAG), in which the nodes describe the operators from one field of table A to another field of table B. We call this the blood relationships (BRs) between fields or tables. If we do some deep dives in the Logical Operators Tree (LOT) of all queries while considering the BRs of fields and tables, we may found there are lots of duplicated or similar computations in our system. If we create materialize views for these common sub-graphs and rewrite queries in the optimizer automatically, there would be large spaces to reduce the resources requirements and make the queries running more efficiently.


Relationships mining for new datasets: We mine the relationships between fields of different tables to help our users auto-cleaning, auto-normalizing, auto-correcting the data and building their data models before they write queries, especially for the unstructured data in operational data store (ODS) layer, such as web-tables and click streams. These relationships mining works lay on only the schema and data itself of tables, not the users’ queries.