MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping

biorxiv(2021)

引用 42|浏览1
暂无评分
摘要
Phylogenetic tree confidence is often estimated from a multiple sequence alignment (MSA) using the Felsenstein bootstrap heuristic. However, this does not account for systematic errors in the MSA, which may cause substantial bias to the inferred phylogeny. Here, I describe the MSA ensemble bootstrap, a new procedure which generates a set of replicate MSAs by varying parameters such as gap penalties and substitution scores. Such an ensemble is called diagnostic if the typical distance between MSAs is comparable to the error rate. Confidence in a prediction derived from an MSA, e.g. a monophyletic clade, is expressed as the fraction of the ensemble where the prediction is reproduced. This approach is implemented in MUSCLE by modifying the Probcons algorithm, which is based on a hidden Markov model (HMM). An ensemble is generated by perturbing HMM parameters and permuting the guide tree. Ensembles generated by this method are shown to be diagnostic on the Balibase benchmark. To enable scaling to large datasets, divide-and-conquer heuristics are introduced. A new benchmark (Balifam) is described with 36 sets of 10000+ proteins. On Balifam, ensembles generated by MUSCLE are shown to align an average of 59% of columns correctly, 13% better than Clustal-omega (52% correct) and 26% better than MAFFT (47% correct). The ensemble bootstrap is applied to a previously published tree of RNA viruses, showing that the high reported Felsenstein bootstrap confidence of Ribovirus phylum branching order is an artifact of systematic MSA errors. Data availability Muscle source code . Balifam benchmark . Qscore source code . Palmscan source code . ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要