Fast discovery of similar sequences in large genomic collections

ADVANCES IN INFORMATION RETRIEVAL(2006)

引用 4|浏览0
暂无评分
摘要
Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible — either in space or in time — as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document fingerprinting for identifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clustering large protein sequence collections.
更多
查看译文
关键词
sequence tag data,genome comparison,genomic collection,large genomic collection,substantial speed improvement,similar sequence,successful existing approach,clustering sequence collection,large protein sequence collection,fast discovery,cd-hit algorithm,improved search speed,sequences,clustering,accuracy,protein sequence,expressed sequence tag
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要