Rapid hierarchical clustering of biological sequences.

ADCS'18: PROCEEDINGS OF THE 23RD AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM(2018)

引用 0|浏览33
暂无评分
摘要
Genomic sequences can be viewed as special types of documents. These are typically organised and stored as collections of text documents. The collections are vast; a single metagenomics sequencing experiment can produce as much text data as the entire contents of Wikipedia. The increasing scale of the datasets obtained from environmental and clinical metagenomic studies has posed great difficulties to researchers attempting to organise this data with traditional clustering techniques. In this work we introduce a clustering approach that combines the highly effective strategy of using binary signature representations of sequence data - allowing extremely fast pairwise comparison through hardware level operations - and couples these representations with an O(N log N) hierarchical clustering approach based on the k-tree data structure. The k-tree approach has previously been applied to the clustering of English-language documents from the ClueWeb collection. Here we extend its use to the clustering of genomic sequence text files. We demonstrate the success of our approach on the largest real biological datasets available from the SILVA project, showing that our methods provide clustering quality comparable to those of a number of standard methods while offering speed performance about an order of magnitude greater than these alternatives. Through the use of synthetic data, we show that this new approach is able to handle even extreme scale data sets in convenient timeframes, allowing rapid analyses to be performed over collections of tens of millions of reads (sequences) that are typically produced in a single sequencing run.
更多
查看译文
关键词
Clustering,document signatures,read sequencing,bioinformatics,hamming distance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要