A Map-Reduce Framework for Clustering Metagenomes

Parallel and Distributed Processing Symposium Workshops & PhD Forum(2013)

引用 31|浏览0
暂无评分
摘要
The past few years has seen an explosion in the use of sequence technologies for met genomics i.e., determination of the collective genome of microorganisms co-existing within several environments. In parallel, there has been rapid development of computational tools for the quantification of abundance, diversity and functionality of different species within these communities. Several clustering algorithms (also called binning algorithms) have been developed to categorize similar met genome sequence reads for efficient post-processing and analysis. In this paper we present a distributed algorithm for clustering met genome sequence reads. The algorithm is implemented within the Map-Reduce based Hadoop platform, and approximates the computation of pair wise sequence similarity with a minwise hashing approach. The algorithm is capable of performing agglomerative hierarchical clustering or a greedy clustering approach and is referred to as MrMC-MinH. The key advantage of MrMC-MinH is it's ability to handle large volumes of sequence reads obtained from targeted 16S metagenomic or whole metagenomic data. We evaluate the performance of our algorithm on several real and simulated metagenome benchmarks and demonstrate that our approach is computationally efficient, and produces accurate clustering results when evaluated using external ground truth. The source code for MrMC-MinH will be made available at the supplementary website.
更多
查看译文
关键词
genome sequence,accurate clustering result,sequence technology,map-reduce framework,greedy clustering approach,clustering metagenomes,whole metagenomic data,agglomerative hierarchical clustering,clustering algorithm,pair wise sequence similarity,collective genome,efficient post-processing,parallel algorithms,microorganisms,approximation algorithms,distributed algorithm,greedy algorithms,algorithm design and analysis,sequential analysis,bioinformatics,genomics,clustering algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要