A Distance Measure for Phylogenetic Analysis of Genomes

msra(2008)

引用 23|浏览11
暂无评分
摘要
It is well known that species phylogenetic analysis based on single genes or parts of the genomes are often inconsistent because of factors such as variable rates of evo- lutions and horizontal gene transfer. The availability of more and more sequenced genomes allows phylogeny construction from complete genomes that is less sensitive to such inconsistency. For such long sequences, construction methods like maximum parsimony (2) and maximum likelihood (4) are often not possible due to their in- tensive computation requirement. Other methods such as the neighbour joining (6) method, require a measure of distances between any two genomes. We present in this study a measure of genetic distance between genomes based on information the- ory (7, 9). This method uses the expert model (3), a biological oriented compression algorithm, to estimate the information content of sequences. We demonstrate that our distance measure can be applied to build the phylogenetic tree of a number of Plasmodium parasites from their genomes. 2. Method and Results Similar to (1), our work is based on the premise that if two sequences are related, one sequence must tell something useful about the other, whose information content can be measured by lossless compression. The information content IX of sequence X can be approximated by the length of the encoded message obtained by compressing X using the expert model. If a sequence Y related to X is available, the expert model can compress sequence X based on the background knowledge from sequence Y to give a measure of conditional information content of X given Y , IX|Y. The more related Y to X, the more information we can save for compressing X. In other words, the dierence between the information content of X and the conditional information content of X given Y is a measure of similarity of the two sequences. The quantity is defined as the shared information: IX,Y = IX ¡ IX|Y. In theory, IX,Y should be equal to IY,X as they both present the shared information of the two sequences. However, this is not always
更多
查看译文
关键词
horizontal gene transfer,compression algorithm,information content,phylogenetic tree,lossless compression,maximum likelihood,genetic distance,variable rate,maximum parsimony
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要