Spectrum preserving tilings enable sparse and modular reference indexing

biorxiv(2023)

引用 5|浏览5
暂无评分
摘要
The reference indexing problem for k -mers is to pre-process a collection of reference genomic sequences ℛ so that the position of all occurrences of any queried k -mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of ℛ that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in ℛ . By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k -mers into: (1) a k -mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k -mer sets can be used to efficiently implement the k -mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k -mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k -mers in ℛ . To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3GB to 34.6GB while incurring only a 3.6× slowdown when querying k -mers from a sequenced readset. Supplementary materials [Sections S.1][1] to [S.8][2] available online at Availability pufferfish2 is implemented in Rust and available at . ### Competing Interest Statement R.P. is a co-founder of Ocean Genomics Inc. [1]: #sec-36 [2]: #sec-53
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要