Efficiently Enumerating Substrings With Statistically Significant Frequencies Of Locally Optimal Occurrences In Gigantic String

THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE(2020)

引用 23|浏览49
暂无评分
摘要
We propose new frequent substring pattern mining which can enumerate all substrings with statistically significant frequencies of their locally optimal occurrences from a given single sequence. Our target application is genome sequences, around a half being said to be covered by interspersed and consecutive (tandem) repeats, and detecting these repeats is an important task in molecular life sciences. We evaluate the statistical significance of frequent substrings by using a string generation model with a memoryless stationary information source. We combine this idea with an existing algorithm, ESFLOO-0G.C (Nakamura et al. 2016), to enumerate all statistically significant substrings with locally optimal occurrences. We further develop a parallelized version of our algorithm. Experimental results using synthetic datasets showed the proposed algorithm achieved far higher F-measure in extracting substrings (with various lengths and frequencies) embedded in a randomly generated string with noise, than conventional algorithms. The large-scale experiment using the whole human genome sequence with 3,095,677,412 bases (letters) showed that our parallel algorithm covers 75% of the whole positions analyzed. around 4% and 24% higher than the recent report and the current cutting-edge knowledge, implying a biologically unique finding.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要