LISA: Learned Indexes for DNA Sequence Analysis

user-5f8411ab4c775e9685ff56d3(2020)

引用 3|浏览46
暂无评分
摘要
Background: Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. Results: In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search-- exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2x and 13.3x speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.
更多
查看译文
关键词
Genomics,DNA sequencing,Sequence analysis,Genome,Computational biology,DNA,Data analysis,Throughput,Computer science,Exact match
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要