Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings.

Journal of Parallel and Distributed Computing(2020)

引用 0|浏览49
暂无评分
摘要
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collection of n sequences of total length N, ϕ be a length threshold, and k be a mismatch threshold. The goal is to identify and report all k-mismatch maximal common substrings of length at least ϕ over all pairs of strings in D. Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of O(NlogkN+occ), where occ is the output size. We then present a distributed memory parallel algorithm with an expected run time of ONplogN+occlogkN using Ologk+1N expected rounds of global communications, under some realistic assumptions, where p is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data.
更多
查看译文
关键词
Approximate sequence matching,String algorithms,Suffix trees,Hamming distance,Parallel algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要