We introduced a new family of codes called Locally Repairable Codes that have marginally suboptimal storage but significantly smaller repair disk I/O and network bandwidth requirements
XORing elephants: novel erasure codes for big data
PVLDB, no. 5 (2013): 325-336
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high stor...更多
下载 PDF 全文
- Facebook and many others are transitioning to erasure coding techniques to introduce redundancy while saving storage [4, 19], especially for data that is more archival in nature.
- Using the parameters of Facebook clusters, the data blocks of each large file are grouped in stripes of 10 and for each such set, 4 parity blocks are created
- This system (called RS (10, 4)) can tolerate any 4 block failures and has a storage overhead of only 40%.
- RS codes are significantly more robust and storage e cient compared to replication
- This storage overhead is the minimal possible, for this level of reliability .
- Codes that achieve this optimal storage-reliability tradeo↵ are called Maximum Distance Separable (MDS)  and Reed-Solomon codes  form the most widely used MDS family
- For this reason, Facebook and many others are transitioning to erasure coding techniques to introduce redundancy while saving storage [4, 19], especially for data that is more archival in nature
- We introduce new erasure codes that address the main challenges of distributed data reliability and information theoretic bounds that show the optimality of our construction
- Our Contributions: We introduce a new family of erasure codes called Locally Repairable Codes (LRCs), that are e ciently repairable both in terms of network bandwidth and disk I/O
- We analytically show that our codes are information theoretically optimal in terms of their locality, i.e., the number of other blocks needed to repair single block failures. We present both randomized and explicit Locally Repairable Codes constructions starting from generalized Reed-Solomon parities
- We prove that Locally Repairable Codes have the optimal distance for that specific locality, due to an information theoretic tradeo↵ that we establish
- We observed 2⇥ disk I/O and network reduction for the cost of 14% more storage, a price that seems reasonable for many scenarios
- We introduced a new family of codes called Locally Repairable Codes (LRCs) that have marginally suboptimal storage but significantly smaller repair disk I/O and network bandwidth requirements
- The authors provide details on a series of experiments the authors performed to evaluate the performance of HDFSXorbas in two environments: Amazon’s Elastic Compute Cloud (EC2)  and a test cluster in Facebook.
- HDFS Bytes Read corresponds to the total amount of data read by the jobs initiated for repair.
- It is obtained by aggregating partial measurements collected from the statistics-reports of the jobs spawned following a failure event.
- Since the cluster does not handle any external tra c, Network Tra c is equal to the amount of data moving into nodes
- It is measured using Amazon’s AWS Cloudwatch monitoring tools.
- Repair Duration is calculated as the time interval between the starting time of the first repair job and the ending time of the last repair job
- Modern storage systems are transitioning to erasure coding. The authors introduced a new family of codes called Locally Repairable Codes (LRCs) that have marginally suboptimal storage but significantly smaller repair disk I/O and network bandwidth requirements.
- One related area where the authors believe locally repairable codes can have a significant impact is purely archival clusters.
- In this case the authors can deploy large LRCs that can simultaneously o↵er high fault tolerance and small storage overhead.
- This would be impractical if Reed-Solomon codes are used since the repair tra c grows linearly in the stripe size.
- Local repairs would further allow spinning disks down  since very few are required for single block repairs
- Table1: Comparison summary of the three schemes. MTTDL assumes independent node failures
- Table2: Repair impact on workload
- Table3: Experiment on Facebook’s Cluster Results
- Optimizing code designs for e cient repair is a topic that has recently attracted significant attention due to its relevance to distributed systems. There is a substantial volume of work and we only try to give a high-level overview here. The interested reader can refer to  and references therein.
The first important distinction in the literature is between functional and exact repair. Functional repair means that when a block is lost, a di↵erent block is created that maintains the (n, k) fault tolerance of the code. The main problem with functional repair is that when a systematic block is lost, it will be replaced with a parity block. While global fault tolerance to n k erasures remains, reading a single block would now require access to k blocks. While this could be useful for archival systems with rare reads, it is not practical for our workloads. Therefore, we are interested only in codes with exact repair so that we can maintain the code systematic.
-  V. Cadambe, S. Jafar, H. Maleki, K. Ramchandran, and C. Suh. Asymptotic interference alignment for optimal repair of mds codes in distributed storage. Submitted to IEEE Transactions on Information Theory, Sep. 2011 (consolidated paper of arXiv:1004.4299 and arXiv:1004.4663).
-  B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143–157, 2011.
-  M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM-Computer Communication Review, pages 98–109, 2011.
-  A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. IEEE Transactions on Information Theory, pages 4539–4551, 2010.
-  A. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A survey on network codes for distributed storage. Proceedings of the IEEE, 99(3):476–489, 2011.
-  B. Fan, W. Tantisiriroj, L. Xiao, and G. Gibson. Diskreduce: Raid for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 6–10. ACM, 2009.
-  D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1–7, 2010.
-  P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin. On the locality of codeword symbols. CoRR, abs/1106.3625, 2011.
-  K. Greenan, J. Plank, and J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In HotStorage, 2010.
-  A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research problems in data center networks. Computer Communications Review (CCR), pages 68–73, 2009.
-  A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. SIGCOMM Comput. Commun. Rev., 39:51–62, Aug. 2009.
-  C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: a scalable and fault-tolerant network structure for data centers. SIGCOMM Comput. Commun. Rev., 38:75–86, August 2008.
-  T. Ho, M. Medard, R. Koetter, D. Karger, M. E↵ros, J. Shi, and B. Leong. A random linear network coding approach to multicast. IEEE Transactions on Information Theory, pages 4413–4430, October 2006.
-  C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade space for access e ciency in reliable data storage systems. NCA, 2007.
-  S. Jaggi, P. Sanders, P. A. Chou, M. E↵ros, S. Egner, K. Jain, and L. Tolhuizen. Polynomial time algorithms for multicast network code construction. Information Theory, IEEE Transactions on, 51(6):1973–1982, 2005.
-  O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST 2012.
-  O. Khan, R. Burns, J. S. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In HotStorage ’11: 3rd Workshop on Hot Topics in Storage and File Systems, Portland, June 2011. USENIX.
-  D. Narayanan, A. Donnelly, and A. Rowstron. Write o↵-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS), 4(3):10, 2008.
-  F. Oggier and A. Datta. Self-repairing homomorphic codes for distributed storage systems. In INFOCOM, 2011 Proceedings IEEE, pages 1215 –1223, april 2011.
-  D. Papailiopoulos and A. G. Dimakis. Locally repairable codes. In ISIT 2012.
-  D. Papailiopoulos, J. Luo, A. Dimakis, C. Huang, and J. Li. Simple regenerating codes: Network coding for cloud storage. Arxiv preprint arXiv:1109.0264, 2011.
-  K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction. Information Theory, IEEE Transactions on, 57(8):5227–5239, 2011.
-  I. Reed and G. Solomon. Polynomial codes over certain finite fields. In Journal of the SIAM, 1960.
-  R. Rodrigues and B. Liskov. High availability in dhts: Erasure coding vs. replication. Peer-to-Peer Systems IV, pages 226–239, 2005.
-  M. Sathiamoorthy, M. Asteris, D. Papailiopoulous, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. USC Technical Report 2012, available online at http://bit.ly/xorbas.
-  N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Interference alignment in regenerating codes for distributed storage: Necessity and code constructions. Information Theory, IEEE Transactions on, 58(4):2134–2158, 2012.
-  I. Tamo, Z. Wang, and J. Bruck. MDS array codes with optimal rebuilding. CoRR, abs/1103.3737, 2011.
-  S. B. Wicker and V. K. Bhargava. Reed-solomon codes and their applications. In IEEE Press, 1994.
-  Q. Xin, E. Miller, T. Schwarz, D. Long, S. Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In MSST, pages 146–156. IEEE, 2003.