Illuminating the functional landscape of the dark proteome across the Animal Tree of Life through natural language processing models

Gemma I Martinez-Redondo, Israel Barrios-Nunez, Marcal Vazquez-Valls,Ana M. Rojas,Rosa Fernandez

biorxiv(2024)

引用 0|浏览1
暂无评分
摘要
Understanding how coding genes and their functions evolve over time is a key aspect of evolutionary biology. Protein coding genes poorly understood or characterized at the functional level may be related to important evolutionary innovations, potentially leading to incomplete or inaccurate models of evolutionary change, and limiting the ability to identify conserved or lineage-specific features. Homology-based methodologies often fail to transfer functional annotations in a large fraction of the coding gene repertoire in non-model organisms. This is particularly relevant in animals, where a high number of their coding genes yield no functional annotation. Here, we leverage machine learning and natural language processing models to investigate functional annotation in the dark proteome (defined as the unknown functional landscape) of ca. 1,000 gene repertoires of virtually all animal phyla, totaling ca. 23.2 million coding genes. Gene ontology annotations were transferred to virtually all genes, with the model ProtT5 outperforming both homology-based and other machine learning-based models. We then explored the dark proteome of all animal phyla revealing an enrichment in functions related to immune response, viral infection, response to stimuli, development, or signaling, among others. Furthermore, we provide an open-access pipeline - FANTASIA - to implement and benchmark these methodologies in any dataset. Our results uncover the putative functions of poorly understood protein-coding genes across the Animal Tree of Life and contribute to a more comprehensive understanding of the molecular basis of animal evolution. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要