**Topic Title: **Interactive Data Analytics via Approximations

** **

**Technical Area:**

- Approximate Query Processing
- OLAP, Sampling, Indexing
- Streaming, Distributed System

** **

**Background**

Repositories and computational platforms of Big Data open the possibility of gaining unprecedented insights, at the price of increased need for computational resources (and/or intolerable latency) for processing queries and analytics over *voluminous*, *high-speed*, and *heterogeneous *data. It is appealing to enable our platform to provide approximate answers within interactive response time or, at least, at a small fraction of the cost of processing queries or analytics in the traditional ways. There are several scenarios where such abilities are critical:

**1.Interactive Data Exploration.**

For example, data analysts in Taobao slice and dice their sales data to understand the sales performance along different dimensions like brand and geographic location with a varying set of filters. Interactive query response time is important here as which query to be issued next may depend on the result of the current query: studies in human-computer interaction show that the analyst typically loses the analysis context if the response time is above one second.

**2.Time-sensitive Analytics with Limited Computational Resources.**

When data is refreshing at a high speed (e.g., every hour) and varying set of queries need to be to answered over the fresh data (e.g., the most recent 48 hours), it is challenging to finish the data processing (building, e.g., indexes and data cubes) in a timely way and answer online queries precisely. This issue becomes more pressing if the computational resource is limited.

**3.Approximation is Enough.**

Some analytical tasks are based on probabilistic models and approximations by themselves, for example, graph embedding. Sampling and approximation techniques can be naturally incorporated into them to have smooth tradeoff between interactivity and precision.

In each (or combinations) of the above scenarios, approximate but interactive analytics could prove effective in helping data scientists identify the subset of data that needs further drill-down, discovering trends, enabling fast visualization, and so on. Approximation techniques for data analytics can be characterized by the following dimensions when to be integrated in a computational platform: i) the generality of the query/analytics supported, ii) error model and accuracy guarantee, iii) the amount of work saved at runtime, iv) the amount of additional resources it requires in precomputation, and v) how easy the integration is.

However, even after decades of research, approximate query processing remains largely confined to academic research and is not a well-established paradigm in today’s products and services. One seemingly obvious reason is that we are unable to build a solution which is satisfactory in all the five dimensions above. On the other hand, different applications routinely leverage application-specific approximations. It is time to re-think under which scenarios and what combinations of the five dimensions will make it possible for applications to find platform-level solution for approximate queries and analytics attractive.

We invite researchers from related domains (e.g., approximate query processing, streaming, distributed database systems, and algorithms) to pursue a principled line of scientific efforts with the goal of deploying approximation-based interactive data analytics techniques into our real large-scale scenarios.

**Target**

We propose a pipeline of targets including both short-term and long-term ones.

In the first stage, the goal is to build and expose basic sampling operators to users, with well-defined mathematical semantics that are proper for different types of applications (e.g., OLAP vs. ML), high performance (considering the distributed environment in MaxCompute), usability, and flexibility (able to be inserted at any position of the analytical pipeline or query plan). These operators will also be useful for the second stage when we build platform-level solutions. There are challenges from both system and algorithmic aspects.

In the second stage, the goal is to build scenario-oriented platform features in order to enable interactive data analytics via approximations (with precision guarantees) for each scenario of interests. Some of such scenarios include: OLAP on wide tables or snowflake schemas, explorative queries on data streams, and analytics on large graphs. We will employ sampling operators developed in the first stage, and even more, e.g., indexes, sketches, and data cubes.

**Related Research Topics**

There are a lot of open research questions in both stages.

* *

*Efficient sampling operators in distributed storage *(1st stage).

When data is stored block by block in a distributed manner, it is challenging to implement traditional sampling algorithms efficiently, e.g., reservoir sampling, stratified sampling, and weighted sampling, meanwhile maintaining (approximately) the same statistical properties. Whether block-level sampling is possible, how much it hurts the promised statistical properties of samples, and how to expose them to users.

* *

*Middleware solution vs. deep integration *(1st & 2nd stage).

Whether we should pursue a middleware solution or a deep-integration framework. Ideally, if a computational platform expose to us a sufficiently large set of operators/APIs, it is possible use them to build approximation views, e.g., random samples, data cubes, and indexes, and rewrite queries in such a way that they be executed on these views. How large does this set need to be? What’s the overhead?

* *

*Accuracy requirements in different scenarios *(1st & 2nd stage).

In different application scenarios, users may have different requirements for “accuracy”. For example, in histograms (answers to aggregation queries), some users care about the overall deviation of distributions, some care about the order of bars, some care about the top-k bars, and some may require no bar is missing. Can we let users specify the error model in their queries (and how?), and choose proper samplers and query plans to minimize the error they care about?

** **

*Adaptive approximation in stream processing *(2nd stage).

When online queries are processed over data streams, there are two parameters: i) resource budget, and ii) throughput. We can add another dimension, iii) precision. While ii) is varying in an unpredictable way, can we have an adaptive tradeoff between i) and iii). For example, when the incoming throughput is increasing dramatically, with the current resource, can the system automatically switches to report approximate answers with the best possible precision? If more resource is available later on, can we recover smoothly from approximate answers back to precise answers?

* *

*Approximate graph analytics *(2nd stage).

Analytics on graphs such as vertex embedding and other random-walk based statistics are essentially computed from probabilistic models. Samples (i.e., random subgraphs in this context) can be used to estimate these statistics under such models. Is it possible to have a smooth tradeoff between the precision and the sample size for graph analytical tasks? For graph queries (e.g., the distance between two vertices), is it possible to have a compact sketch such that each query can be answered efficiently (e.g., in constant time) and approximately (e.g., with precision guarantees)?