Kineograph: taking the pulse of a fast-changing and connected world

EuroSys, (2012)

Cited: 288|Views165
EI WOS SCOPUS

Abstract

Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graph-mining algorithms to extract timely insights from the fast-changing graph structure. To accommodate gra...More

Code:

Data:

0
Introduction
  • Popular services such as Twitter, Facebook, and Foursquare represent a significant departure from websearch and web-mining applications that have been driving much of the distributed systems research in the last decade.
  • Information available on those emerging services has two defining characteristics.
  • Information search and retrieval on micro-blogs has started to receive a lot of attention [27]
Highlights
  • Popular services such as Twitter, Facebook, and Foursquare represent a significant departure from websearch and web-mining applications that have been driving much of the distributed systems research in the last decade
  • We have developed three representative applications on Kineograph with real Twitter feeds for experimentation: TunkRank [31] for user ranking, SP [28] for approximate shortest path, and K-exposure [27] for controversial topic detection
  • Our results show that Kineograph produces timely mining results, such that on average the computed results reflected all tweets updated within 2.5 minutes
  • This externally imposed order does not take into account any causal relationship. It reflects neither the physical-time order nor any causal order. We find it sufficient in our case, partly because Kineograph separates graph updates from graph mining—graph updates are usually simple and straightforward
  • We evaluated Kineograph on a cluster with up to 51 machines, each connected with Gigabit Ethernet. 25 of the machines contained an Intel Xeon X3360 CPU and 8GB memory
  • While this paper focuses on an overall architectural design with novel constructs, we expect to see new abstractions and building blocks emerging in the near future
Results
  • The authors have implemented Kineograph using C# with more than 17,000 lines of code, excluding test code.
  • Table 1 summarizes the lines of code for different components in the Kineograph system and its applications.
  • The authors evaluated Kineograph on a cluster with up to 51 machines, each connected with Gigabit Ethernet.
  • 25 of the machines contained an Intel Xeon X3360 CPU and 8GB memory.
  • The remaining had an Intel Xeon X5550 CPU and 12GB memory.
  • All the machines ran the 64-bit version of Windows Server 2008R2 with .NET framework 4.0
Conclusion
  • Kineograph reflects the belief that there is a potential paradigm shift in distributed-system research
  • It departs from the “traditional” areas of high-throughput and scalable batch systems, as represented by systems such as GFS and MapReduce.
  • The new paradigm is inspired by increasingly popular social networking, micro-blogging, and mobile Internet applications.
  • These services are more centered around graph-based storage and computation, while striking a different and delicate balance among timeliness, consistency, and throughput.
  • While this paper focuses on an overall architectural design with novel constructs, the authors expect to see new abstractions and building blocks emerging in the near future
Tables
  • Table1: Line of code count breakdown
  • Table2: The impact of transient imbalance on throughput under K-Exposure, with 8 ingest nodes, 32 graph nodes. SCT Max/Avg: The ratio of maximum over average Snapshot Construction Time
Download tables as Excel
Related work
  • Kineograph builds on a large body of existing literature in distributed systems and database systems. We focus on three most related areas: distributed in-memory storage (key/value) systems, incremental data processing, and graph computation.

    Distributed in-memory storage systems. Distributed inmemory key/value stores have received a lot of attention, both in the research community and in the industry [13, 19, 21]. Kineograph leverages this technology, adds basic graph support, and more importantly supports snapshots.

    Incremental data processing. Recently, many research efforts have focused on improving computation efficiency through augmenting existing scalable batch-processing engines with incremental computation capability. Systems like
Reference
  • R. Angles and C. Gutierrez. Survey of graph database models. ACM Computing Surveys (CSUR), 40(1):1–39, 2008.
    Google ScholarLocate open access versionFindings
  • P. Bhatotia, A. Wieder, R. Rodrigues, U. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In ACM SoCC, 2011.
    Google ScholarLocate open access versionFindings
  • Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010.
    Google ScholarLocate open access versionFindings
  • M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, 2006.
    Google ScholarLocate open access versionFindings
  • D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams – a new class of data management applications. In VLDB, 2002.
    Google ScholarLocate open access versionFindings
  • S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
    Google ScholarFindings
  • J. Chen, D. J. Dewitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, 2000.
    Google ScholarLocate open access versionFindings
  • T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd. edition, 2001.
    Google ScholarFindings
  • J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1): 107–113, 2008.
    Google ScholarLocate open access versionFindings
  • P. Gunda, L. Ravindranath, C. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010.
    Google ScholarLocate open access versionFindings
  • B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched stream processing in data intensive distributed computing. In ACM SoCC, 2010.
    Google ScholarLocate open access versionFindings
  • P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, 2010.
    Google ScholarLocate open access versionFindings
  • R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: A highperformance, distributed main memory transaction processing system. In VLDB, 2008.
    Google ScholarLocate open access versionFindings
  • U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In IEEE International Conference on Data Mining, 2009.
    Google ScholarLocate open access versionFindings
  • L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998.
    Google ScholarLocate open access versionFindings
  • D. Logothetis, C. Olston, B. Reed, K. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In ACM SoCC, 2010.
    Google ScholarLocate open access versionFindings
  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence(UAI), 2010.
    Google ScholarLocate open access versionFindings
  • G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for largescale graph processing. In SIGMOD, 2010.
    Google ScholarLocate open access versionFindings
  • memcached. Memcached: A distributed memory object caching system, 2011. http://memcached.org. Neo4j: The graph database, 2011.
    Locate open access versionFindings
  • http://neo4j.org. Accessed October, 2011.
    Findings
  • D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In SOSP, 2011.
    Google ScholarLocate open access versionFindings
  • L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Technical Report, 1999.
    Google ScholarFindings
  • R. Pearce, M. Gokhale, and N. Amato. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010.
    Google ScholarLocate open access versionFindings
  • D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010.
    Google ScholarLocate open access versionFindings
  • L. Popa, M. Budiu, Y. Yu, and M. Isard. Dryadinc: Reusing work in large-scale computations. In HotCloud, 2009.
    Google ScholarLocate open access versionFindings
  • R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI, 2010.
    Google ScholarLocate open access versionFindings
  • D. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In WWW, 2011.
    Google ScholarLocate open access versionFindings
  • A. Sarma, S. Gollapudi, M. Najork, and R. Panigrahy. A sketch-based distance oracle for web-scale graphs. In WSDM, 2010. Tweets about steve jobs spike but don’t break twitter peak record, 2011.
    Google ScholarFindings
  • http://searchengineland.com/tweets-about-stevejobs-spike-but-dont-break-twitter-record-96048.
    Findings
  • A. Tanenbaum. Distributed Operating Systems. Prentice Hall, 1995.
    Google ScholarFindings
  • D. Tunkelang. A twitter analog to pagerank. Retrieved from http://thenoisychannel.com/2009/01/13/a-twitter-analog-topagerank, 2009.
    Findings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn