Chrome Extension
WeChat Mini Program
Use on ChatGLM

Automatized Bioinformatics Data Integration in a Hadoop-based Data Lake

Artificial Intelligence and Applications(2022)

Cited 0|Views6
No score
Abstract
When we work in a data lake, data integration is not easy, mainly because the data is usually stored in raw format. Manually performing data integration is a time-consuming task that requires the supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration points. The algorithm for data integration is based on the Overlap coefficient since it presented better results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results when compared to an analysis of a specialist, and we expect our model can be reused for other domains of datasets.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined