Protein database search using compressed k-mer vocabularies.

Lawrence Buckingham,Shlomo Geva,James M. Hogan

ADCS'18: PROCEEDINGS OF THE 23RD AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM（2018）

引用 0|浏览34

暂无评分

摘要

Efficient and accurate search in biological sequence databases remains a matter of priority due to the on-going rapid accumulation of genomic data becoming available for analysis. An array of accelerated sequence comparison methods have been implemented, including tools which compute explicit pairwise alignments, and alignment-free techniques based on word co-occurrence, locality sensitive hashing, or metric embedding. These methods offer significant speed improvement over standard algorithms, but increased throughput comes at the cost of reduced sensitivity. Strategies such as inverted indexing and hashing enable efficient retrieval of stored sequences which share near-identical common sub-sequences with a query, but their precision diminishes as the level of shared identity decreases, so that sequences which are distantly related to the query go undetected. We present a new sequence database search algorithm which derives a compressed vocabulary consisting of sub-sequences of length k (k-mers) sampled from the database, and uses the compressed vocabulary to map each sequence to a binary feature vector based on its content. Feature vector similarity is taken as a proxy for more expensive local alignment measurements. Feature vector construction seamlessly incorporates biologically grounded symbol substitutions, so the algorithm remains effective at low levels of sequence identity. Empirical tests conducted with real-world data demonstrate that the binary vector encoding permits ranking accuracy that rivals and in some cases exceeds that of mainstream database search programs, with run times that are faster by an order of magnitude or more.

查看译文

关键词

Sequence comparison,Protein database search

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要