Title: Data Layout Research for Low Cost Cold Storage System

 

Technical Area: Storage

 

Background

Lots of data stored in cloud storage is rarely accessed, which can be called as cold data. Today the volume of cold data is growing bigger and bigger and how to store those type of data becomes a significant problem affects public cloud services. In traditional storage area, DataDomain proposed an innovative method to reduce redundant data and enable disk as backup medium to reduce the cost of cold data backup. Data de-duplication technology now has become standard configuration in disk backup system. Besides data reduction technology, low-cost storage medium selection is another effective way to reduce cost. Tape is a general low cost medium to be used in archive system. But considering backup/restore performance, storage space utilization and system maintenance cost, tape is not the cheapest medium for cold data. We all know that in cloud environment, log-structured design has already been widely used to simplify distributed system design and improve system reliability. Since log-structured design uses append write, garbage collection (GC) should be used to reclaim unused resource on storage medium. It’s not feasible to apply GC on tape due to it cannot be accessed in random. Because of this, storage space utilization on tape is lower than disk, which affects TCO significantly. In general, after using data de-duplication, compression and high-density storage technologies, disk is also a suitable medium to build low-cost cold storage system for cloud computing.

 

Power consumption is a major issue while using high-density disk to build cold storage system. If a rack can support 8 high-density JBODs, more than 1000 disks can be installed. Generally power supply inside a rack cannot offer enough power to make all disks run actively. This is a basic reason why need to reduce power consumption of disks. In large-scale data center, storage power consumption is the major part of cost, that’s the major reason to control power consumption for low cost storage system. The main characteristic of cold data is rare to access, if data layout on disk is reasonable, most of disks can keep offline, and partialactive disks offer read/write services. Following this idea, one rack can support more than thousands of disks, and active power consumption also can be controlled to reduce total cost.

 

Target

The target of this project is to design a cost-effective cold storage system to be used in large-scale cloud computing environment and some innovative approaches should be proposed to address encountered challenges. Power consumption oriented data layout, scheduling method and data de-duplication mechanism should be designed to balance cost and performance. By using this low-cost cold storage system in cloud environment, the expected storage cost should be reduced less than 1 cent per GB each month.

 

Related Research Topics

Traditional cold storage system just focus on data reduction but not considers about large-scale application environment and power consumption issues. In cloud storage environment, large-scale storage system consumes lots of power, which becomes major issue affect system design and implementation. In order to meet scalable cloud storage requirements, this project need to do data layout research considering power consumption limitation. Some potential research topics can be listed as below:

 

Expected results for this project is to propose some ideas to address above large scale cold storage system issues and publish some high quality papers. In order to verify proposed ideas, prove of concept (POC) should be built.