Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks
BCB, pp. 1-16, 2020.
EI
Weibo:
Abstract:
The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer n...More
Code:
Data:
Introduction
- The advent of new protein sequencing technologies has accelerated the rate of protein discovery [1].
- Often referred to as word embeddings, these vector representations are typically “pre-trained" on an auxiliary task for which the authors have a large amount of training data
- The goal of this pre-training is to learn generically useful representations that encode deep semantic and syntactic information [12].
- These “smart" representations can be used to train systems for NLP tasks for which the authors have only a moderate amount of training data
Highlights
- The advent of new protein sequencing technologies has accelerated the rate of protein discovery [1]
- We show PRoBERTa’s performance when the model is fine-tuned for the Protein Family Classification and Protein-Protein Interaction (PPI) Prediction tasks
- We propose a Transformer based neural network architecture, called PRoBERTa, for protein characterization tasks
- We used embeddings from PRoBERTa for a fundamentally different problem, PPI Prediction, using two different datasets generated from the HIPPIE database and found that with sufficient data, it substantially outperforms the current state-of-the-art method in the conservative scenario and still performs better than the other methods in the aggressive scenario
- This, combined with the larger decrease in Normalized Mutual Information (NMI) with protein families in the aggressive scenario (Figure 4), suggests that the model in the conservative scenario performs something closer to a protein classification task to identify which proteins are present in the HIPPIE dataset and are more likely to correspond to positive interaction examples
- PRoBERTa’s success in these two different protein prediction tasks alludes to the generality of the embeddings and their potential to be used in other tasks such as predicting protein binding affinity, protein interaction types and identifying proteins associated with particular diseases
Methods
- The authors treat proteins as a “language” and draw ideas from the state-of-the-art techniques in natural language processing to obtain a vector representation for proteins.
- For a sequence of amino acids to be treated as a sentence, the alphabet of the language is defined to be the set of symbols.
- Before amino acid sequences can be interpreted as a language, the authors must first define what a word is.
- There has been recent interest [49] in statistically determining segments of amino acids to be used as inputs for downstream machine learning algorithms using an NLP method called byte pair encoding (BPE) [50].
Results
- The authors first describe the sequence features learned from the pre-trained model, before the fine-tuning stage.
- The authors show PRoBERTa’s performance when the model is fine-tuned for the Protein Family Classification and PPI Prediction tasks.
- 3.1 Protein Embeddings from the Pre-Trained Model.
- The authors pre-trained the PRoBERTa model as described in Section 2.3 on 4 NVIDIA V100 GPUs in 18 hours.
- The authors first asked whether the pre-trained model contained any biological meaning in the amino acid sequences.
- The pre-trained model is already able to distinguish between these protein families
Conclusion
- The authors propose a Transformer based neural network architecture, called PRoBERTa, for protein characterization tasks.
- The authors used embeddings from PRoBERTa for a fundamentally different problem, PPI Prediction, using two different datasets generated from the HIPPIE database and found that with sufficient data, it substantially outperforms the current state-of-the-art method in the conservative scenario and still performs better than the other methods in the aggressive scenario.
- This, combined with the larger decrease in NMI with protein families in the aggressive scenario (Figure 4), suggests that the model in the conservative scenario performs something closer to a protein classification task to identify which proteins are present in the HIPPIE dataset and are more likely to correspond to positive interaction examples.
- In light of the COVID-19 pandemic, the authors are currently working on adapting PRoBERTa for vaccine design
Summary
Introduction:
The advent of new protein sequencing technologies has accelerated the rate of protein discovery [1].- Often referred to as word embeddings, these vector representations are typically “pre-trained" on an auxiliary task for which the authors have a large amount of training data
- The goal of this pre-training is to learn generically useful representations that encode deep semantic and syntactic information [12].
- These “smart" representations can be used to train systems for NLP tasks for which the authors have only a moderate amount of training data
Objectives:
In the pre-training stage, the objective is to train the model to learn task-agnostic deep representations that capture high-level structure of amino acid sequences.Methods:
The authors treat proteins as a “language” and draw ideas from the state-of-the-art techniques in natural language processing to obtain a vector representation for proteins.- For a sequence of amino acids to be treated as a sentence, the alphabet of the language is defined to be the set of symbols.
- Before amino acid sequences can be interpreted as a language, the authors must first define what a word is.
- There has been recent interest [49] in statistically determining segments of amino acids to be used as inputs for downstream machine learning algorithms using an NLP method called byte pair encoding (BPE) [50].
Results:
The authors first describe the sequence features learned from the pre-trained model, before the fine-tuning stage.- The authors show PRoBERTa’s performance when the model is fine-tuned for the Protein Family Classification and PPI Prediction tasks.
- 3.1 Protein Embeddings from the Pre-Trained Model.
- The authors pre-trained the PRoBERTa model as described in Section 2.3 on 4 NVIDIA V100 GPUs in 18 hours.
- The authors first asked whether the pre-trained model contained any biological meaning in the amino acid sequences.
- The pre-trained model is already able to distinguish between these protein families
Conclusion:
The authors propose a Transformer based neural network architecture, called PRoBERTa, for protein characterization tasks.- The authors used embeddings from PRoBERTa for a fundamentally different problem, PPI Prediction, using two different datasets generated from the HIPPIE database and found that with sufficient data, it substantially outperforms the current state-of-the-art method in the conservative scenario and still performs better than the other methods in the aggressive scenario.
- This, combined with the larger decrease in NMI with protein families in the aggressive scenario (Figure 4), suggests that the model in the conservative scenario performs something closer to a protein classification task to identify which proteins are present in the HIPPIE dataset and are more likely to correspond to positive interaction examples.
- In light of the COVID-19 pandemic, the authors are currently working on adapting PRoBERTa for vaccine design
Tables
- Table1: Comparison of binary family classification
- Table2: Comparison of multi-class family classification
- Table3: PPI prediction results using 20% of training data (top) and using 100% of training data (bottom)
Funding
- This work has been supported by the National Science Foundation (awards #1750981 and #1725729)
- This work has also been partially supported by the Google Cloud Platform research credits program (to AR, MH, and AN)
Study subjects and analysis
proteins: 50
Clustering the vectors fine-tuned on the protein family classification task increases the NMI even more than the pre-trained model (Figure 4), suggesting that the fine-tuned embeddings have more specific information related to protein classification. In the binary classification task, we trained a separate logistic regression classifier for each protein family with greater than 50 proteins and measured the weighted mean accuracy as 0.98 with the lowest scoring family being made up of 57 proteins and having an accuracy of 0.77. In performing this, we randomly withheld 30% of the proteins from each family to be used as the test set
UniProt proteins with only one associated family: 313214
In the multi-class family classification task, we used fine-tuning to add an output layer that maps to protein family labels. This was done using the dataset of 313,214 UniProt proteins with only one associated family. These proteins were split into train/validation/test sets (0.8/0.1/0.1) and provided us with a classifier that had an accuracy of 0.92 on the test set
proteins: 250504
The PPI models appears to be more robust (they have smaller slopes) than the Protein Family model. However, it should be noted that the complete train set for the Protein Family model contained 250,504 proteins, while the PPI model had 480,455 interactions in the conservative scenario and 429,239 interactions in the aggressive scenario. This difference in robustness could be due to the absolute difference in number of training data points
Reference
- Laura Restrepo-Pérez, Chirlmin Joo, and Cees Dekker. Paving the way to single-molecule protein sequencing. Nature nanotechnology, 13(9):786–796, 2018.
- The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research, 47(D1):D506– D515, 11 2018.
- Minsik Oh, Seokjun Seo, Sun Kim, and Youngjune Park. DeepFam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics, 34(13):i254–i262, 06 2018.
- Muhao Chen, Chelsea J T Ju, Guangyu Zhou, Xuelu Chen, Tianran Zhang, Kai-Wei Chang, Carlo Zaniolo, and Wei Wang. Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics, 35(14):i305–i314, 07 2019.
- Temple F Smith, Michael S Waterman, et al. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981.
- Christof Angermueller, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. Deep learning for computational biology. Molecular systems biology, 12(7), 2016.
- Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, Wei Xie, Gail L. Rosen, Benjamin J. Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E. Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M. Cofer, Christopher A. Lavender, Srinivas C. Turaga, Amr M. Alexandari, Zhiyong Lu, David J. Harris, Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K. Wiley, Marwin H. S. Segler, Simina M. Boca, S. Joshua Swamidass, Austin Huang, Anthony Gitter, and Casey S. Greene. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society, Interface, 15(141):20170387, Apr 2018. 29618526[pmid].
- Jianzhu Ma, Michael Ku Yu, Samson Fong, Keiichiro Ono, Eric Sage, Barry Demchak, Roded Sharan, and Trey Ideker. Using deep learning to model the hierarchical structure and function of a cell. Nature Methods, 15(4):290–298, 2018.
- Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo. A universal snp and small-indel variant caller using deep neural networks. Nature Biotechnology, 36(10):983–987, 2018.
- Alex Zhavoronkov, Yan A. Ivanenkov, Alex Aliper, Mark S. Veselov, Vladimir A. Aladinskiy, Anastasiya V. Aladinskaya, Victor A. Terentiev, Daniil A. Polykovskiy, Maksim D. Kuznetsov, Arip Asadulaev, Yury Volkov, Artem Zholus, Rim R. Shayakhmetov, Alexander Zhebrak, Lidiya I. Minaeva, Bogdan A. Zagribelnyy, Lennart H. Lee, Richard Soll, David Madge, Li Xing, Tao Guo, and Alán Aspuru-Guzik. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature Biotechnology, 37(9):1038–1040, 2019.
- Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, 1999.
- Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709, 2017.
- Mark A. Bedau, Nicholas Gigliotti, Tobias Janssen, Alec Kosik, Ananthan Nambiar, and Norman Packard. Open-ended technological innovation. Artificial Life, 25(1):33–49, 2019. PMID: 30933632.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. CoRR, abs/1802.05365, 2018.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Ehsaneddin Asgari and Mohammad R. K. Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLOS ONE, 10(11):1–15, 11 2015.
- Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nachaev, Florian Matthes, and Burkhard Rost. Modeling the language of life – deep learning protein sequences. bioRxiv, 2019.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020.
- Natalie L. Dawson, Ian Sillitoe, Jonathan G. Lees, Su Datt Lam, and Christine A." Orengo. CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences, pages 79–110. Springer New York, New York, NY, 2017.
- Julian Gough, Kevin Karplus, Richard Hughey, and Cyrus Chothia. Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure11edited by g. von heijne. Journal of Molecular Biology, 313(4):903 – 919, 2001.
- Huaiyu Mi, Sagar Poudel, Anushya Muruganujan, John T. Casagrande, and Paul D. Thomas. PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Research, 44(D1):D336–D342, 11 2015.
- Marco Punta, Penny C. Coggill, Ruth Y. Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, Alex Bateman, and Robert D. Finn. The Pfam protein families database. Nucleic Acids Research, 40(D1):D290–D301, 11 2011.
- Sayoni Das and Christine A. Orengo. Protein function annotation using protein domain family resources. Methods, 93:24 – 34, 2016. Computational protein function predictions.
- Maxwell L. Bileschi, David Belanger, Drew Bryant, Theo Sanderson, Brandon Carter, D. Sculley, Mark A. DePristo, and Lucy J. Colwell. Using deep learning to annotate the protein universe. bioRxiv, 2019.
- Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek. UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 01 2020. btaa003.
- Javier De Las Rivas and Celia Fontanillo. Protein-protein interactions essentials: Key concepts to building and analyzing interactome networks. PLOS Computational Biology, 6(6):1–8, 06 2010.
- Tuba Sevimoglu and Kazim Yalcin Arga. The role of protein interaction networks in systems biomedicine. Computational and Structural Biotechnology Journal, 11(18):22 – 27, 2014.
- Uros Kuzmanov and Andrew Emili. Protein-protein interaction networks: probing disease mechanisms using model systems. Genome Medicine, 5(4), Apr 2013.
- Ioanna Petta, Sam Lievens, Claude Libert, Jan Tavernier, and Karolien De Bosscher. Modulation of protein–protein interactions for the development of novel therapeutics. Molecular Therapy, 24(4):707–718, Apr 2016.
- Diego Alonso-López, Francisco J Campos-Laborie, Miguel A Gutiérrez, Luke Lambourne, Michael A Calderwood, Marc Vidal, and Javier De Las Rivas. APID database: redefining protein-protein interaction experimental evidences and binary interactomes. Database, 2019, 01 2019.
- Alberto Calderone, Luisa Castagnoli, and Gianni Cesareni. mentha: a resource for browsing integrated proteininteraction networks. Nature Methods, 10(8):690–691, 2013.
- Henning Hermjakob, Luisa Montecchi-Palazzi, Chris Lewington, Sugath Mudali, Samuel Kerrien, Sandra Orchard, Martin Vingron, Bernd Roechert, Peter Roepstorff, Alfonso Valencia, Hanah Margalit, John Armstrong, Amos Bairoch, Gianni Cesareni, David Sherman, and Rolf Apweiler. Intact: an open source molecular interaction database. Nucleic acids research, 32(Database issue):D452–D455, Jan 2004.
- Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena Santonico, Luisa Castagnoli, and Gianni Cesareni. Mint, the molecular interaction database: 2012 update. Nucleic acids research, 40(Database issue):D857–D861, Jan 2012.
- Ulrich Stelzl, Uwe Worm, Maciej Lalowski, Christian Haenig, Felix H. Brembeck, Heike Goehler, Martin Stroedicke, Martina Zenkner, Anke Schoenherr, Susanne Koeppen, Jan Timm, Sascha Mintzlaff, Claudia Abraham, Nicole Bock, Silvia Kietzmann, Astrid Goedde, Engin Toksöz, Anja Droege, Sylvia Krobitsch, Bernhard Korn, Walter Birchmeier, Hans Lehrach, and Erich E. Wanker. A human protein-protein interaction network: A resource for annotating the proteome. Cell, 122(6):957–968, Sep 2005.
- Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, Lars J Jensen, and Christian von Mering. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genomewide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 11 2018.
- Yanzhi Guo, Lezheng Yu, Zhining Wen, and Menglong Li. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Research, 36(9):3025– 3030, 04 2008.
- Xue-Wen Chen and Mei Liu. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, 21(24):4394–4400, 10 2005.
- Shao-Wu Zhang, Li-Yang Hao, and Ting-He Zhang. Prediction of protein-protein interaction with pairwise kernel support vector machine. International Journal of Molecular Sciences, 15(2):3220–3233, Feb 2014.
- Yi Guo and Xiang Chen. A deep learning framework for improving protein interaction prediction using sequence properties. bioRxiv, 2019.
- Tanlin Sun, Bo Zhou, Luhua Lai, and Jianfeng Pei. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics, 18(1):277, 2017.
- Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, 2019.
- Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pretraining time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.
- Nomenclature and symbolism for amino acids and peptides. European Journal of Biochemistry, 138(1):9–37, 1984.
- Somaye Hashemifar, Behnam Neyshabur, Aly A Khan, and Jinbo Xu. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics, 34(17):i802–i810, 09 2018.
- Ehsaneddin Asgari, Alice C. McHardy, and Mohammad R. K. Mofrad. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (dimotif) and sequence embedding (protvecx). Scientific Reports, 9(1):3577, 2019.
- Philip Gage. A new algorithm for data compression. C Users J., 12(2):23–38, February 1994.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics.
- Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, November 2018.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.
- Gregorio Alanis-Lobato, Miguel A. Andrade-Navarro, and Martin H. Schaefer. HIPPIE v2.0: enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Research, 45(D1):D408–D414, 10 2016.
- Tobias Hamp and Burkhard Rost. Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics, 31(12):1945–1950, 02 2015.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “‘nearest neighbor’” meaningful? In Proceedings of the 7th International Conference on Database Theory, ICDT ’99, page 217–235, Berlin, Heidelberg, 1999. Springer-Verlag.
- Ananthan Nambiar, Mark Hopkins, and Anna Ritz. Computing the language of life: Nlp approaches to feature extraction for protein classification. In ISMB/ECCB 2019: Poster Session, 2019.
Full Text
Tags
Comments