Gene Sketching

Jordan Boyd-Graber, Adrian de Froment, Alex Golovinskiy,Jesse Levinson

msra(2005)

引用 23|浏览19
暂无评分
摘要
The central character of modern bioinformatics is the analysis of data to find patterns that represent relevant biological information. The success, however, of this field may one day lead to serious problems as too much data inundates the methods and resources traditionally used. A world where every individual has his or her genome mapped and stored in a central medical database might lead to more information than even future computers following the trajectory of Moore's law could handle. Other fields have experienced similar problems with information overload and have developed interesting solutions. One such problem is the task of searching for similar images from a large repository of pictures. These images, which are essentially just arrays of color intensity, are inefficient representations of their contents. If we are only interested in the differences between two images, the only information we want to store is the "distance," however we define it. Using a compact representation of the information that attempted to only retain the distance between two images, (Lv) was able to create an image similarity search system with over a hundredfold decrease in storage size and a commensurate increase in execution time without significantly hampering the accuracy of the comparisons. This paper outlines the creation and subsequent evaluation of a novel compression scheme for biological information motivated by the success of similar compact representations. After outlining the theory behind our approach and the tools and techniques used to implement it, we will present an analysis of the effectiveness and accuracy of the methodology for typical biological tasks such as finding similar subsets of genes and predicting gene ontology. We then present some possible extensions.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要