Topic Title: Learning, Predication and Optimization for Query Processing on Massive Computing Platform


Technical Area: Machine Learning and Big Data Processing



The massive computing platform, MaxCompute (also known as ODPS), is one of the most critical infrastructures at Alibaba Group, which is responsible for the largest portion of Alibaba’s computing business. Generally, MaxCompute needs to process millions of tasks/queries and petabytes of data every day, which are still increasing continuously due to business growth. Although the performance of MaxCompute platform has been improved significantly with rounds of upgrade, the contradiction between increasing business demand and expensive, limited computational resources is getting more severe, which urges Computing Platform Division to seek for more effective, sustainable and automatic solution to better allocate resources, predicate task cost, and optimize the overall system performance.


More specifically, the effectiveness and efficiency of MaxCompute platform mainly depends on two factors. Firstly, the effectiveness of the generated distributed query plan depends on an accurate estimate of the data distribution and query complexity. For instance, it is hard to decide which type of join method to be used without precise cardinality estimation. The computational complexity UDFs (User Defined Function) are also not explicitly available, which means partial ordering relation analysis and cost estimation can only be performed based on experience. Secondly, the efficiency of executing a distributed query plan depends on the fine management of tasks and resources. Since in MaxCompute every query will be converted into a complex DAG execution plan, the resource allocation for both input nodes and intermediate nodes of DAG heavily affect the efficiency of the entire query processing pipeline, and can cause serious performance bottleneck if any of the nodes is assigned less resources than needed.


The keys underneath the above two factors are accurate predication for query complexity and global optimization for resource allocation. Although MaxCompute is currently adopting simple optimization strategies for cost estimation and resource allocation, they are largely dependent on heuristics (e.g., data size), experiences (e.g., a previous similar query) and manual settings. Therefore, how to maximize the utility of computing and storage resources for such a large-scale platform (tens of thousands of nodes) with dynamic business requirement to optimize the overall system performance becomes the core problem.



MaxCompute platform has accumulated large amount of historical data covering its entire query processing pipeline such as data cardinality, query cost, allocated resource and execution statistics. This data provides valuable opportunities to get deeper into the system internals and gain more insight on the complex relationship among data, queries and resources. We invite researchers from database, data mining and system optimization areas to take advantage of this data and design more effective and intelligent mechanisms for resource predication, allocation and optimization on our massive computing platform.


Researchers who propose principled effort to investigate this problem are encouraged to bear in mind with the following research objectives:



Related Research Topics

This project concerns learning, predication and optimization for query processing, with each having separate line of extensive research in literature. A wide spectrum of techniques and methodologies have been proposed and are still evolving, which span across the areas of database, data mining, machine learning and system optimization. However, we notice that current researches are relatively scattered, e.g., learning, predication and optimization techniques are not simultaneously considered and collectively designed due to different focuses of researchers from different areas. Thus the overall performance of the platform has lots of room to improve.


We accept proposals of research topics related to one or multiple afore-mentioned research objectives, which include but are not limited to: