Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

DSN(2014)

引用 191|浏览176
暂无评分
摘要
Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. For example, we found that while traditional error protection increases memory system cost by 12.5%, some applications can achieve 99.00% availability on a single server with a large number of memory errors without any error protection. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications. Toward this end, in this paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new dataintensive workloads (an interactive web search application, an in-memory key -- value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-o s. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.
更多
查看译文
关键词
datacenter cost optimization,in-memory key-value store,tco,graph mining framework,memory error tolerance,computer centres,single server availability,software reliability,memory errors,datacenter total cost of ownership,datacenter cost,memory errors, software reliability, memory architectures, soft errors, hard errors, datacenter cost, dram,dataintensive workloads,software heterogeneous-reliability memory system design,one-size-fits-all memory reliability techniques,internet,dram,interactive web search application,data-intensive applications,application memory error vulnerability,server hardware cost reduction,hard errors,software fault tolerance,data mining,memory devices,graph theory,soft errors,error protection,memory architectures,error reduction,cost reduction,hardware heterogeneous-reliability memory system design,memory management,hardware,reliability,servers
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要