Gobblin: Unifying Data Ingestion for Hadoop.
PVLDB(2015)
摘要
Data ingestion is an essential part of companies and organizations that collect and analyze large volumes of data. This paper describes Gobblin, a generic data ingestion framework for Hadoop and one of LinkedIn's latest open source products. At LinkedIn we need to ingest data from various sources such as relational stores, NoSQL stores, streaming systems, REST endpoints, filesystems, etc. into our Hadoop clusters. Maintaining independent pipelines for each source can lead to various operational problems. Gobblin aims to solve this issue by providing a centralized data ingestion framework that makes it easy to support ingesting data from a variety of sources. Gobblin distinguishes itself from similar frameworks by focusing on three core principles: generality, extensibility, and operability. Gobblin supports a mixture of data sources out-of-the-box and can be easily extended for more. This enables an organization to use a single framework to handle different data ingestion needs, making it easy and inexpensive to operate. Moreover, with an end-to-end metrics collection and reporting module, Gobblin makes it simple and efficient to identify issues in production.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络