AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In a world in which large data sets are measured in tera- and petabytes, a new form of parallel computing has emerged as an easy-to-program, reliable, and distributed paradigm to process these massive quantities

A Model of Computation for MapReduce

Symposium on Discrete Algorithms, (2010): 938-948

Cited: 397|Views145
EI

Abstract

In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive c...More

Code:

Data:

Introduction
  • In a world in which large data sets are measured in tera- and petabytes, a new form of parallel computing has emerged as an easy-to-program, reliable, and distributed paradigm to process these massive quantities

    1.1 MapReduce Basics In the MapReduce programming paradigm, the basic unit of information is a key; value pair where each key and each value are binary strings.
  • The input to any MapReduce algorithm is a set of key; value pairs.
  • The mapper μ takes as input a single key; value pair, and produces as output any.
  • It is crucial that the map operation is stateless—that is, it operates on one pair at a time.
  • This allows for easy parallelization as different inputs for the map can be processed by different machines
Highlights
  • In a world in which large data sets are measured in tera- and petabytes, a new form of parallel computing has emerged as an easy-to-program, reliable, and distributed paradigm to process these massive quantities

    1.1 MapReduce Basics In the MapReduce programming paradigm, the basic unit of information is a key; value pair where each key and each value are binary strings
  • We demonstrate how algorithms can take advantage of this fact to compute an Minimum Spanning Tree of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model
  • Since the comparisons are cleaner in the deterministic case, we focus on DMRC here, but there are analogous questions for MapReduce Class
  • We begin by describing a basic building block of many algorithms in MapReduce Class called “MapReduce Class-parallelizable functions.” We show how a family of such functions can used as subroutines of MapReduce Class computations
  • We have presented a rigorous computational model for the MapReduce paradigm
Conclusion
  • The authors pause to justify some of the modeling decisions made above.

    3.1.1 Machines As the authors argued before, it is unrealistic to assume that there are a linear number of machines available when n large.
  • More than one instance of a reducer may be run on the same machine.The authors have presented a rigorous computational model for the MapReduce paradigm
  • By restricting both the total memory per machine and the total number of machines to O(n1− ) the authors ensure that the programmer must parallelize the computation and that the number of machines used must remain relatively small.
  • Rather the authors require that mappers, as well as reduces run in polynomial time
Related work
  • We begin by comparing the MapReduce framework with other models of parallel computation. After that we discuss other works that use MapReduce.

    4.1 Comparing MapReduce and PRAMs Numerous models of parallel computation have been proposed in the literature; see [1] for a survey of them. While the most popular by far for theoretical study is the PRAM, probably the next two most popular are LogP, proposed by Culler et al [2], and BSP, proposed by Valiant [12]. These three models are all architecture independent. Other researchers have studied architecture-dependent models, such as the fixedconnection network model described in [10].

    Since the most prevalent model in theoretical computer science is the PRAM, it seems most appropriate to compare our MapReduce model to it. In a PRAM, an arbitrary number of processors, sharing an unboundedly large memory, operate synchronously on a shared input to produce some output. Different variants of PRAMs deal differently with issues of concurrent reading and concurrent writing, but the differences are insignificant from our perspective. One usually assumes that, to solve a problem of some size n, the number of processors should be bounded by a polynomial in n— a necessary, but hardly sufficient, condition to ensure efficiency.
Reference
  • D. K. G. Campbell. A survey of models of parallel computation. Technical report, University of York, March 1997.
    Google ScholarFindings
  • D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 4:1–12, May 1993.
    Google ScholarLocate open access versionFindings
  • A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In Proceedings of WWW, pages 271–280, 2007.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of OSDI, pages 137–150, 2004.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008.
    Google ScholarLocate open access versionFindings
  • J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. In S.-H. Teng, editor, SODA, pages 710–719. SIAM, 2008.
    Google ScholarFindings
  • R. L. Graham. Bounds on multiprocessing anomalies and related packing algorithms. In AFIPS ’71 (Fall): Proceedings of the November 16-18, 1971, fall joint computer conference, pages 205–217, New York, NY, USA, 1971. ACM.
    Google ScholarLocate open access versionFindings
  • Hadoop wiki - powered by. http://wiki.apache.org/hadoop/PoweredBy.
    Findings
  • U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec. HADI: Fast diameter estimation and mining in massive graphs with hadoop. Technical Report CMU-ML-08-117, CMU, December 2008.
    Google ScholarFindings
  • F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992.
    Google ScholarFindings
  • C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: Counting triangles in massive graphs with a coin. In Knowledge Discovery and Data Mining (KDD), 2009.
    Google ScholarLocate open access versionFindings
  • L. G. Valiant. A bridging model for parallel computation. CACM, 33(8):103–111, August 1990.
    Google ScholarLocate open access versionFindings
  • V. V. Vazirani. Approximation Algorithms. Springer, March 2004.
    Google ScholarFindings
  • Yahoo! partners with four top universities to advance cloud computing systems and applications research. Yahoo! Press Release, 2009. http://research.yahoo.com/news/2743.
    Locate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn