Log-normalizing to read depth outperforms compositional data transformations in machine learning applications

Aaron Yerke, Daisy Brumit,Anthony Fodor

Research Square (Research Square)（2023）

引用 0|浏览1

暂无评分

摘要

Abstract Background: Normalization, as a pre-processing step, can significantly affect the resolution of machine learning analysis for microbiome studies. There are countless options for normalization scheme selection. In this study, we examined compositionally aware algorithms including the additive log ratio (alr), the centered log ratio (clr), and a recent evolution of the isometric log ratio (ilr) in the form of balance trees made with the PhILR R package. We also looked at compositionally naïve transformations such as raw counts tables and a transformation that log-normalizes samples to the average read depth (which we call “lognorm”). Results: In our evaluation, we used 62 metadata variables culled from four publicly available datasets at the Amplicon Sequence Variant (ASV) level with a random forest machine learning algorithm, which demonstrate that random forest was reliably among the most effective machine learning classification algorithms. We found that different common pre-processing steps in the creation of the balance trees made very little difference in overall performance. Overall, we found that the compositionally aware data transformations such as alr, clr, and ilr (PhILR) performed generally slightly worse or only as well as compositionally naïve transformations. However, the lognorm transformation outperformed all other transformations by a small but reliably statistically significant margin. Conclusions: Our results suggest that minimizing the complexity of transformations while correcting for read depth may be a generally preferable strategy in preparing data for machine learning compared to more sophisticated, but more complex, transformations that attempt to better correct for compositionality.

查看译文

关键词

compositional data transformations,depth,machine learning,log-normalizing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要