Kmcex: Memory-Frugal and Retrieval-Efficient Encoding of Counted K-Mers.

Peng Jiang,Jie Luo,Yiqi Wang,Pingji Deng,Bertil Schmidt,Xiangjun Tang,Ningjiang Chen,Limsoon Wong,Liang Zhao

Bioinformatics（2019）

引用 4|浏览34

暂无评分

摘要

MOTIVATION:K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability.RESULTS:We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude.AVAILABILITY AND IMPLEMENTATION:The source codes of our algorithm are available at github.com/lzhLab/kmcEx.SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要