DandD: Efficient measurement of sequence growth and similarity

ISCIENCE(2024)

引用 0|浏览4
暂无评分
摘要
Genome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows. DandD can describe how much structural variation is discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called 8 ("delta"), developed initially for data compression and chiefly dependent on k-mer counts. DandD rapidly estimates 8 using genomic sketches. We propose 8 as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, thereby avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD's functions for estimating 8, characterizing the rate of pangenome growth, and computing all -pairs similarities using k -independent Jaccard.
更多
查看译文
关键词
Genomics,Biocomputational method,Genomic analysis,Sequence analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要