Theme Title: Towards Efficient and Scalable Online Tracing for Datacenter Workloads


Technical Area: Resource Management



Workload tracing is the fundamental technology that enables a variety of important applications, including debugging, performance optimization, trace-driven simulation, resource demand analysis and beyond. It is also the indispensable vehicle to access and understand the internal workload characteristics when the workload is proprietary or its runtime environment is irreproducible.


However, existing tracing tools and methods proposed over the past decades are confronted with the challenges imposed by recent paradigm shift of consolidating workloads in datacenters. Unlike standalone offline workloads, datacenter workloads typically appear as inter-dependent micro- services running in a large cluster with virtualized environment difficult to replicate offline. This very nature of datacenter workloads creates new demanding requirements on tracing technologies that otherwise not existed. For example, while Software Development Emulator (SDE) is able to generate replayable instruction traces, it incurs significant tracing overhead and usually causes a 10~100X slowdown for the application under tracing. Such a huge slowdown is unacceptable when tracing datacenter workloads online, since the slowdown may ripple through the dependent workloads, causing the system to behave abnormally. In contrast, the hardware-assisted Processor Trace technology can perform lightweight instruction tracing by capturing the dynamic branch information. Nevertheless, the captured trace cannot be replayed as it only contains the instruction flow and misses the data trace, resulting in little value other than debugging. Therefore, we are facing a dire situation, that is, more and more workloads are consolidated in datacenters; whereas the tracing technologies are outdated and unable to capture quality traces of datacenter workloads for offline analysis. If remain unaddressed, this may lead to our increased misunderstanding of the behaviors of datacenter workloads, thereby missing the optimization opportunities across the system stacks.



The goal of this research is to explore, architect and design a new and comprehensive online tracing infrastructure that meets the stringent requirements of datacenter workloads and paves the way for efficient, replayable and scalable workload tracing. We invite researchers who are experts or are keenly aware of the challenges and opportunities in this fields to contribute to this foundational technology that many important application scenarios crucially hinges upon.


Related Research Topics

Specifically, to make the online tracing infrastructure suitable to the properties of datacenter workloads, it should address the following challenges and/or topics:


1. Efficiency

Since the online workloads are typically latency/response time sensitive, it is very important for the infrastructure to keep the tracing overhead as minimal as possible, otherwise the behavior of the workload may no longer representative.


2. Replayability

The collected trace should be replayable offline and reproduce the result same as that online, which allows the trace lend itself to various usage cases and maximize its value. This also means both instruction and data traces need to be captured without incurring significant overhead, which could be challenging.


3. Deployability

Since datacenter workloads are typically scheduled or migrated among servers, it is important for the tracing infrastructure to be able to monitor the start of a given workload, trigger tracing at a programmed time lapse, and aggregate the trace to the designated servers. These capabilities combined represent the deployability of the infrastructure that adapts to the workload dynamics in the datacenter.


4. Virtualization Compatibility

Since a majority of the workloads in modern datacenters are packed in virtual machines, containers or running in JVMs, the tracing infrastructure should be able to perform tracing regardless of the virtualization technology employed by the workloads.