Kineograph: taking the pulse of a fast-changing and connected world
EI WOS SCOPUS
Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graph-mining algorithms to extract timely insights from the fast-changing graph structure. To accommodate gra...More
PPT (Upload PPT)
- Popular services such as Twitter, Facebook, and Foursquare represent a significant departure from websearch and web-mining applications that have been driving much of the distributed systems research in the last decade.
- Information available on those emerging services has two defining characteristics.
- Information search and retrieval on micro-blogs has started to receive a lot of attention 
- Popular services such as Twitter, Facebook, and Foursquare represent a significant departure from websearch and web-mining applications that have been driving much of the distributed systems research in the last decade
- We have developed three representative applications on Kineograph with real Twitter feeds for experimentation: TunkRank  for user ranking, SP  for approximate shortest path, and K-exposure  for controversial topic detection
- Our results show that Kineograph produces timely mining results, such that on average the computed results reflected all tweets updated within 2.5 minutes
- This externally imposed order does not take into account any causal relationship. It reflects neither the physical-time order nor any causal order. We find it sufficient in our case, partly because Kineograph separates graph updates from graph mining—graph updates are usually simple and straightforward
- We evaluated Kineograph on a cluster with up to 51 machines, each connected with Gigabit Ethernet. 25 of the machines contained an Intel Xeon X3360 CPU and 8GB memory
- While this paper focuses on an overall architectural design with novel constructs, we expect to see new abstractions and building blocks emerging in the near future
- The authors have implemented Kineograph using C# with more than 17,000 lines of code, excluding test code.
- Table 1 summarizes the lines of code for different components in the Kineograph system and its applications.
- The authors evaluated Kineograph on a cluster with up to 51 machines, each connected with Gigabit Ethernet.
- 25 of the machines contained an Intel Xeon X3360 CPU and 8GB memory.
- The remaining had an Intel Xeon X5550 CPU and 12GB memory.
- All the machines ran the 64-bit version of Windows Server 2008R2 with .NET framework 4.0
- Kineograph reflects the belief that there is a potential paradigm shift in distributed-system research
- It departs from the “traditional” areas of high-throughput and scalable batch systems, as represented by systems such as GFS and MapReduce.
- The new paradigm is inspired by increasingly popular social networking, micro-blogging, and mobile Internet applications.
- These services are more centered around graph-based storage and computation, while striking a different and delicate balance among timeliness, consistency, and throughput.
- While this paper focuses on an overall architectural design with novel constructs, the authors expect to see new abstractions and building blocks emerging in the near future
- Table1: Line of code count breakdown
- Table2: The impact of transient imbalance on throughput under K-Exposure, with 8 ingest nodes, 32 graph nodes. SCT Max/Avg: The ratio of maximum over average Snapshot Construction Time
- Kineograph builds on a large body of existing literature in distributed systems and database systems. We focus on three most related areas: distributed in-memory storage (key/value) systems, incremental data processing, and graph computation.
Distributed in-memory storage systems. Distributed inmemory key/value stores have received a lot of attention, both in the research community and in the industry [13, 19, 21]. Kineograph leverages this technology, adds basic graph support, and more importantly supports snapshots.
Incremental data processing. Recently, many research efforts have focused on improving computation efficiency through augmenting existing scalable batch-processing engines with incremental computation capability. Systems like
- R. Angles and C. Gutierrez. Survey of graph database models. ACM Computing Surveys (CSUR), 40(1):1–39, 2008.
- P. Bhatotia, A. Wieder, R. Rodrigues, U. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In ACM SoCC, 2011.
- Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010.
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, 2006.
- D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams – a new class of data management applications. In VLDB, 2002.
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003.
- J. Chen, D. J. Dewitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, 2000.
- T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd. edition, 2001.
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1): 107–113, 2008.
- P. Gunda, L. Ravindranath, C. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010.
- B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched stream processing in data intensive distributed computing. In ACM SoCC, 2010.
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, 2010.
- R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: A highperformance, distributed main memory transaction processing system. In VLDB, 2008.
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In IEEE International Conference on Data Mining, 2009.
- L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998.
- D. Logothetis, C. Olston, B. Reed, K. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In ACM SoCC, 2010.
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence(UAI), 2010.
- G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for largescale graph processing. In SIGMOD, 2010.
- memcached. Memcached: A distributed memory object caching system, 2011. http://memcached.org. Neo4j: The graph database, 2011.
- http://neo4j.org. Accessed October, 2011.
- D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In SOSP, 2011.
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Technical Report, 1999.
- R. Pearce, M. Gokhale, and N. Amato. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010.
- D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010.
- L. Popa, M. Budiu, Y. Yu, and M. Isard. Dryadinc: Reusing work in large-scale computations. In HotCloud, 2009.
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI, 2010.
- D. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In WWW, 2011.
- A. Sarma, S. Gollapudi, M. Najork, and R. Panigrahy. A sketch-based distance oracle for web-scale graphs. In WSDM, 2010. Tweets about steve jobs spike but don’t break twitter peak record, 2011.
- A. Tanenbaum. Distributed Operating Systems. Prentice Hall, 1995.
- D. Tunkelang. A twitter analog to pagerank. Retrieved from http://thenoisychannel.com/2009/01/13/a-twitter-analog-topagerank, 2009.