Gazelle: transcript abundance query against large-scale RNA-seq experiments

BCB(2021)

引用 0|浏览8
暂无评分
摘要
ABSTRACTThe exponential growth of high throughput sequencing data has been witnessed in almost every sequencing data repository. To date, most of the exploratory analysis on these large datasets requires heavy lifting data processing pipelines that are both resource and labor intensive. Very recently, various algorithms have been developed to enable arbitrary sequence query over large collections of sequencing data. These algorithms were designed to support presence/absence query, i.e., screening for RNA-seq samples containing a given transcript sequence. Their utility is rather limited as they cannot retrieve abundance information of query sequence. Such abundance information is indeed critical in real applications in order to understand how the variation of transcript expression associates with different biological conditions or disease subtypes. In this paper, we present Gazelle, a sequence query engine that enables fast and quantified query against large-scale RNA-seq experiments. Gazelle exploits the advantages of two different types of hashing algorithms and seamlessly combines them into one integrated structure to support highly efficient and accurate sequence queries with abundance. We evaluated the performance of Gazelle on three datasets to benchmark its efficiency, accuracy as well as its utility in real-life applications. Our result shows that Gazelle achieves near-perfect k-mer query, supports on-demand sequence query against moderately large sequence database, and renders highly consistent abundance estimation with RT-qPCR as well as traditional transcript quantification method such as Kallisto.
更多
查看译文
关键词
transcript query, RNA-seq, indexing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要