FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
CoRR(2024)
摘要
Metagenomic data, comprising mixed multi-species genomes, are prevalent in
diverse environments like oceans and soils, significantly impacting human
health and ecological functions. However, current research relies on K-mer
representations, limiting the capture of structurally relevant gene contexts.
To address these limitations and further our understanding of complex
relationships between metagenomic sequences and their functions, we introduce a
protein-based gene representation as a context-aware and structure-relevant
tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene
group-level pre-training, providing insights into inter-gene contextual
information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for
gene-level pre-training to model gene sequence-function relationships. MGM and
TEM-CL constitute our novel metagenomic language model , pre-trained on
100 million metagenomic sequences. We demonstrate the superiority of our
proposed on eight datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要