Transformation Of Fasta Files Into Feature Vectors For Unsupervised Compression Of Short Reads Databases

JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY（2021）

引用 5|浏览15

暂无评分

摘要

FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97 GB).

查看译文

关键词

Sequence reads, k-minimizer, compression, clustering

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要