Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance.

arXiv: Distributed, Parallel, and Cluster Computing(2016)

引用 1|浏览159
1. Summary Recent studies estimate that server cost contributes to as much as 57% of the total cost of ownership (TCO) of a datacenter [1]. One key contributor to this high server cost is the procurement of memory devices such as DRAMs, especially for data-intensive datacenter cloud applications that need low latency (such as web search, in-memory caching, and graph traversal). Such memory devices, however, may be prone to hardware errors that occur due to unintended bit flips during device operation [40, 33, 41, 20]. To protect against such errors, traditional systems uniformly employ devices with highquality chips and error correction techniques, both of which increase device cost. At the same time, we make the observations that 1) data-intensive applications exhibit a diverse spectrum of tolerance to memory errors, and 2) traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. Our DSN-44 paper [30] is the first to 1) understand how tolerant different data-intensive applications are to memory errors and 2) design a new memory system organization that matches hardware reliability to application tolerance in order to reduce system cost. The main idea of our approach is to classify applications based on their memory error tolerance, and map applications to heterogeneous-reliability memory system designs managed cooperatively between hardware and software to reduce system cost. Our DSN-44 paper provides the following contributions: 1. A new methodology to quantify the tolerance of applications to memory errors. Our approach measures the effect of memory errors on application correctness and quantifies an application’s ability to mask or recover from memory errors. 2. A comprehensive characterization of the memory error tolerance of three data-intensive workloads: an interactive web search application [30, 39], an in-memory key‐value store [30, 3], and a graph mining framework [30, 29]. We find that there exists an order of magnitude difference in memory error tolerance across these three applications. 3. An exploration of the design space of new memory system organizations, called heterogeneous-reliability memory, which combines a heterogeneous mix of reliability techniques that leverage application error tolerance to reduce system cost. We show that our techniques can reduce server hardware cost by 4.7%, while achieving 99.90% single server availability.
AI 理解论文