Language models of protein sequences at the scale of evolution enable accurate structure prediction

Zeming Lin,Halil Akin,Roshan Rao, Brian Hie, Zhongkai Zhu,Wenting Lu,Allan dos Santos Costa,Maryam Fazel-Zarandi,Tom Sercu, Sal Candido,Alexander Rives

user-60f947d94c775efc5de23468（2022）

引用 413|浏览225

暂无评分

摘要

Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要