HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis Using Hadoop.

IEEE International Conference on e-Science（2014）

引用 2|浏览4

暂无评分

摘要

Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads. We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction.

查看译文

关键词

hadoop, massively parallel sequencing, map-reduce, bioinformatics

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要