Title: Analytic Database based on GPUs

 

Technical Area: Database

 

Background

With increasing volumes of data, storage and computing in analytic database are facing tremendous challenges. To save storage and disk IOs, data should be compressed, but data compression and decompression consume a lot of CPU which affects other running tasks. Some operations, such as join and aggregation, also consume many CPU cycles. In many scenarios, usage of CPU becomes the bottleneck. However, the Moore’s Law gradually cease to hold true, and the development of CPU cannot keep up with the growth of data.

 

As a new kind of hardware, GPU has a better computation capability, of which the computing power is increasing fast. In academia, some scientists use GPU to improve some operations, such as compress join, group by and so on, and achieve 10 times and more better in terms of performance improvement; In the IT industry, several GPU databases has appeared for about 5 years, like SQream, mapD, blazingDB.

 

Target

The theme target is to utilize GPU to improve computing performance in analytic database, and to design a database based on GPU.

1. Storage format for GPU: Improve storage and IO performance.

The key of study point includes storage module,smart index for GPU, smart meta data, compressing/decompressing algorithm for GPU.

2. Resource management and scheduling for GPU: Improve GPU resource utilization.

This includes host/device memory management, GPU computing resource scheduling, and multi-GPU scheduling

3. Algorithms

This includes multi-column sort algorithm, join algorithm and other algorithms in database. All these algorithms are embedded in GPU.

4. Optimizer for GPU database

Optimizers in databases never consider the use of new hardware, like GPU. When GPUs are the key hardware in GPU database, it is considered the effects of GPU.

 

Related Research Topics

1. mapD, sqream, Brylyt and blazingDB. These are GPU databases, which are designed based on GPU. They provide high performance and low costs. But they cannot support full SQL syntax. In addition, if the data is too big, the performance will degrade. So, the bigger the data, the higher concurrency and the more complex SQL it is, which are more challenging.

 

2. Some papers provide algorithms to speed up join, group by, order by. Most of them focus on algorithm but not big data. When data size is huge, especially the case when it is larger than memory size and disk space, it is to be considered as how to manager and schedule the data, device memory and GPU.