Topic Title: Towards Highly-Efficient Memory Management for Distributed Big Data Processing


Technical Area: Memory Management of Big Data Processing


The big data processing systems that emerge recently, such as Spark, can process huge volumes of data in a scale-out fashion. Unlike traditional database systems using declarative query languages and relational (or multidimensional) data models, these systems allow users to implement application logics through User Defined Functions (UDFs) and User Defined Types (UDTs) using high-level imperative languages (such as Java, Scala and C# etc.), which can then be automatically parallelized onto a large-scale cluster.


Existing researches in these systems mostly focus on scalability and fault-tolerance issues in a distributed environment. However, some recent studies suggest that the execution efficiency of individual tasks in these systems is low. A major reason is that both the execution frameworks and user programs of these systems are implemented using high-level imperative languages running in managed runtime platforms (such as JVM, .NET CLR, etc.). These managed runtime platforms commonly have built-in automatic memory management, which brings significant memory and CPU overheads. For example, the modern tracing-based garbage collectors (GC) may consume a large amount of CPU cycles to trace living objects in the heap.


Furthermore, to improve the performance of multi-stage and iterative computations, recently developed systems support caching of intermediate data in the main memory and exploit eager combining and aggregating of data in the shuffling phases. These techniques would generate massive long-living data objects in the heap, which usually stay in the memory for a significant portion of the job execution time. However, the unnecessary continuous tracing and marking of such large number of long-living objects by the GC would consume significant CPU cycles.




The inefficiency of memory management in managed runtime platforms for big data processing systems has been widely acknowledged. Existing efforts to minimize the overhead of memory management can be categorized into the following directions:



1. In-Memory Serialization: Many distributed data processing systems, such as Hadoop, Spark and Flink, support serializing in-memory data objects into byte arrays in order to reduce memory management overhead. However, object serialization and de-serialization have long been acknowledged as having a high overhead.


2. GC Optimization: Most traditional GC tuning techniques are proposed for long-running latency sensitive web servers rather than throughput-oriented big data processing. Implementing better GC algorithms is another line of work. These approaches’ requirements of modifying JVMs prevent them being adopted on production environments.


3. Region-based memory management (RBMM): In RBMM, all objects are grouped into a set of hierarchical regions, which are the basic units for space reclamation. This approach requires the developers to explicitly define the mapping from objects to regions, as well as the hierarchical structures of regions.


4. Domain Specific Systems: Some domain-specific data-parallel systems make use of its specific computation structure to realize more complex memory management. Spark SQL transforms relational tables to serialized bytes in a main-memory columnar storage. Unlike general-purposed programming models, it is difficult to implement advanced iterative applications such as machine learning and graph mining algorithms with SQL-like APIs.


We propose a principled effort to investigate a novel automatic memory management approach for highly efficient distributed data processing. With the new approach, application developers do not need to migrate from general-purposed data-parallel programming models to less expressive domain-specific languages/APIs. It also does not require users to manually allocate/reclaim space for in-memory data objects with native languages such as C/C++.


Designing a user-transparent memory management that fundamentally solves the problem of high GC overhead, as well as keeps the generality and expressiveness of the programming model, is inherently difficult and poses several challenges where related research topics may arise:


Related Research Topics

1. Configuration Auto-Tuning: To achieve optimal memory management performance, the end users are required to determine a large number of performance-critical configuration parameters. The key challenge here is the high dimensional configuration issue.


2. Compact In-Memory Object Representation: Large space are consumed by object headers and object references, leading to low packing factor of the memory.


3. Lifetime-based Memory Management: A typical big data processing task performs similar computation repeatedly, where groups of related data items often have similar lifetime behaviors. It is promising to design new techniques that automatically group the byte sequences of data items with the same lifetime into a few byte arrays (memory pages), thereby simplifying space reclamation.