谷歌浏览器插件
订阅小程序
在清言上使用

Meta-colored Compacted De Bruijn Graphs.

bioRxiv the preprint server for biology(2023)

引用 0|浏览12
暂无评分
摘要
The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k -mers to the set of references in which they appear. The c-dBG data structure should retrieve this set—the color of the k -mer—efficiently for any given k -mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. We describe the meta-colored compacted de Bruijn graph (Mac-dBG)—a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability. A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor .
更多
查看译文
关键词
Planar Graph Embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要