Paper of the ML Systems Workshop@NIPS 2017
Recently deep learning plays an increasingly important role in various applications. The essential logic of training deep learning models involves parallel linear algebra calculation which is suitable for GPU. However, due to physical constraints, GPU usually has lesser device memory than host memory. The latest high-end NVIDIA GPU P100 is equipped with 12–16 GB device memory, while a CPU server has 128GB host memory. On the contrary, the trend for deep learning models is to have a “deeper and wider” architecture. For example, ResNet  consists of up to 1001 neuron layers and a Neural Machine Translation(NMT) model consists of 8 layers using attention mechanism, and most of layers in NMT model are sequential ones unrolling horizontally which brings non-neglectable memory consumption.
Read the full paper: