Apache Spark: a unified engine for big data processing

Communications of The ACM, pp. 56-65, 2016.

Cited by: 1377|Bibtex|Views424|DOI:https://doi.org/10.1145/2934664
EI
Other Links: dblp.uni-trier.de|dl.acm.org|academic.microsoft.com
Weibo:
All Apache Spark libraries described in this article are open source at http:// spark.apache.org/

Abstract:

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

Code:

Data:

0
Introduction
  • THE GROWTH OF data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges.
  • Spark has a programming model similar to MapReduce but extends it with a data-sharing abstraction called “Resilient Distributed Datasets,” or RDDs.[25] Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing[2,26,6]
  • These implementations use the same optimizations as specialized engines and achieve similar performance but run as libraries over a common engine, making them easy and efficient to compose.
Highlights
  • THE GROWTH OF data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges
  • In 2009, our group at the University of California, Berkeley, started the Apache Spark project to design a unified engine for distributed data processing
  • Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing[2,26,6]
  • Given that these libraries run over the same engine, do they lose performance? We found that by implementing the optimizations we just outlined within RDDs, we can often match the performance of specialized engines
  • We hope Apache Spark highlights the importance of composability in programming libraries for big data and encourages development of more interoperable libraries
  • All Apache Spark libraries described in this article are open source at http:// spark.apache.org/
Conclusion
  • Scalable data processing will be essential for the generation of computer applications but typically involves a complex sequence of processing steps with different computing systems.
  • To simplify this task, the Spark project introduced a unified programming model and engine for big data applications.
  • All Apache Spark libraries described in this article are open source at http:// spark.apache.org/.
  • Databricks has made videos of all Spark Summit conference talks available for free at https://spark-summit.org/
Summary
  • Introduction:

    THE GROWTH OF data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges.
  • Spark has a programming model similar to MapReduce but extends it with a data-sharing abstraction called “Resilient Distributed Datasets,” or RDDs.[25] Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing[2,26,6]
  • These implementations use the same optimizations as specialized engines and achieve similar performance but run as libraries over a common engine, making them easy and efficient to compose.
  • Conclusion:

    Scalable data processing will be essential for the generation of computer applications but typically involves a complex sequence of processing steps with different computing systems.
  • To simplify this task, the Spark project introduced a unified programming model and engine for big data applications.
  • All Apache Spark libraries described in this article are open source at http:// spark.apache.org/.
  • Databricks has made videos of all Spark Summit conference talks available for free at https://spark-summit.org/
Funding
  • Berkeley’s research on Spark was supported in part by National Science Foundation CISE Expeditions Award CCF-1139158, Lawrence Berkeley National Laboratory Award 7076018, and DARPA XData Award FA875012-2-0331, and gifts from Amazon Web Services, Google, SAP, IBM, The Thomas and Stacey Siebel Foundation, Adobe, Apple, Arimo, Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, Facebook, Guavus, Huawei, Informatica, Intel, Microsoft, NetApp, Pivotal, Samsung, Schlumberger, Splunk, Virdata, and VMware
Reference
  • Apache Storm project; http://storm.apache.org 2. Armbrust, M. et al. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD/PODS Conference (Melbourne, Australia, May 31–June 4). ACM Press, New York, 2015.
    Locate open access versionFindings
  • 3. Dave, A. Indexedrdd project; http://github.com/ Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation (San Francisco, CA, Dec. 6–8). USENIX Association, Berkeley, CA, 2004.
    Locate open access versionFindings
  • 5. Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., Looger, L.L., and Ahrens, M.B. Mapping brain activity at scale with cluster computing. Nature Methods 11, 9 (Sept. 2014), 941–950.
    Google ScholarLocate open access versionFindings
  • 6. Gonzalez, J.E. et al. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th OSDI Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6–8). USENIX Association, Berkeley, CA, 2014.
    Google ScholarLocate open access versionFindings
  • 7. Isard, M. et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the EuroSys Conference (Lisbon, Portugal, Mar. 21–23). ACM Press, New York, 2007.
    Google ScholarLocate open access versionFindings
  • 8. Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the ACM-SIAM SODA Symposium on Discrete Algorithms (Austin, TX, Jan. 17–19). ACM Press, New York, 2010.
    Google ScholarLocate open access versionFindings
  • 9. Kornacker, M. et al. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 4–7, 2015).
    Google ScholarLocate open access versionFindings
  • 10. Low, Y. et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the 38th International VLDB Conference on Very Large Databases (Istanbul, Turkey, Aug. 27–31, 2012).
    Google ScholarLocate open access versionFindings
  • 11. Malewicz, G. et al. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD/PODS Conference (Indianapolis, IN, June 6–11). ACM Press, New York, 2010.
    Google ScholarLocate open access versionFindings
  • 12. McSherry, F., Isard, M., and Murray, D.G. Scalability! But at what COST? In Proceedings of the 15th HotOS Workshop on Hot Topics in Operating Systems (Kartause Ittingen, Switzerland, May 18–20). USENIX Association, Berkeley, CA, 2015.
    Google ScholarLocate open access versionFindings
  • 13. Melnik, S. et al. Dremel: Interactive analysis of Webscale datasets. Proceedings of the VLDB Endowment 3 (Sept. 2010), 330–339.
    Google ScholarLocate open access versionFindings
  • 14. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1–7.
    Google ScholarLocate open access versionFindings
  • 15. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., and Patterson, D.A. Rethinking dataintensive science using scalable analytics systems. In Proceedings of the SIGMOD/PODS Conference (Melbourne, Australia, May 31–June 4). ACM Press, New York, 2015.
    Google ScholarLocate open access versionFindings
  • 16. Shun, J. and Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN PPoPP Symposium on Principles and Practice of Parallel Programming (Shenzhen, China, Feb. 23–27). ACM Press, New York, 2013.
    Google ScholarLocate open access versionFindings
  • 17. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., and Kraska, T. MLI: An API for distributed machine learning. In Proceedings of the IEEE ICDM International Conference on Data Mining (Dallas, TX, Dec. 7–10). IEEE Press, 2013.
    Google ScholarLocate open access versionFindings
  • 18. Stonebraker, M. and Cetintemel, U. ‘One size fits all’: An idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Engineering (Tokyo, Japan, Apr. 5–8). IEEE Computer Society, Washington, D.C., 2005, 2–11.
    Google ScholarLocate open access versionFindings
  • 19. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. Design and evaluation of a real-time URL spam filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May 22–25). IEEE Press, 2011.
    Google ScholarLocate open access versionFindings
  • 20. Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103–111.
    Google ScholarLocate open access versionFindings
  • 21. Venkataraman, S. et al. SparkR; http://dl.acm.org/citation.cfm?id=2903740&CFID=687410325&CFTO KEN=83630888 22. Xin, R. and Zaharia, M. Lessons from running largescale Spark workloads; http://tinyurl.com/largescale-spark 23.
    Locate open access versionFindings
  • Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., and Stoica, I. Shark: SQL and rich analytics at scale. In Proceedings of the ACM SIGMOD/PODS Conference (New York, June 22–27). ACM Press, New York, 2013.
    Google ScholarLocate open access versionFindings
  • 24. Zaharia, M. An Architecture for Fast and General Data Processing on Large Clusters. Ph.D. thesis, Electrical Engineering and Computer Sciences Department, University of California, Berkeley, 2014; https://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf 25. Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Ninth USENIX NSDI Symposium on Networked Systems Design and Implementation (San Jose, CA, Apr. 25–27, 2012).
    Locate open access versionFindings
  • 26. Zaharia, M. et al. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM SOSP Symposium on Operating Systems Principles (Farmington, PA, Nov. 3–6). ACM Press, New York, 2013.
    Google ScholarLocate open access versionFindings
  • 27. Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In Proceedings of IEEE International Conference on Big Data (Santa Clara, CA, Oct. 29–Nov. 1). IEEE, 2015. Watch the authors discuss their work in this exclusive Communications video. http://cacm.acm.org/videos/spark
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments