We present the Position-Specific Scoring Matrix-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of Protein secondary structure prediction on lowquality PSSM
Pssm-Distil: Protein Secondary Structure Prediction (Pssp) On Low-Quality Pssm By Knowledge Distillation With Contrastive Learning
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLI..., (2021): 617-625
下载 PDF 全文
Protein secondary structure prediction (PSSP) is an essential task in computational biology. To achieve the accurate PSSP, the general and vital feature engineering is to use multiple sequence alignment (MSA) for Position-Specific Scoring Matrix (PSSM) extraction. However, when only low-quality PSSM can be obtained due to poor sequence ho...更多
下载 PDF 全文
- Especially protein tertiary (3D) structure, plays a critical role for practical protein applications, such as the understanding of the protein functions and the design of drugs (Noble, Endicott, and Johnson 2004).
- There are three mainstream methods for protein tertiary structure (3D) prediction, i.e., X-ray crystallography and nuclear magnetic resonance (NMR) (Wuthrich 1989), cryo-EM based methods (Wang et al 2015) and computeraided ab initio prediction (Mandell and Kortemme 2009).
- Given the extremely time-consuming drawback of X-ray crystallography, the sequence length limitation of nuclear magnetic resonance (NMR) and the expensive equipment requirement for cryo-EM, computer-assisted protein structure
- Protein structure analysis, especially protein tertiary (3D) structure, plays a critical role for practical protein applications, such as the understanding of the protein functions and the design of drugs (Noble, Endicott, and Johnson 2004)
- To give a more detailed comparison, we split the protein sequences with low-quality Position-Specific Scoring Matrix (PSSM) into several divisions of multiple sequence alignment (MSA) count and MSA meff according to Guo et al (2020)
- Shown in Table 2 and Table 3, our approach achieves the best performance on protein sequences with low-quality PSSMs under regardless of low MSA count score or low MSA meff score settings
- We present the PSSM-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of Protein secondary structure prediction (PSSP) on lowquality PSSM
- We jointly train the EnhanceNet and a student network for PSSM enhancement and PSSP by using the low-quality PSSMs, which are downsampled from high-quality PSSMs
- We remove the proteins that share more than 25% sequence identity with our CullPDB dataset
- Regardless of the sophisticated architecture design, the novel loss function is elaborated with knowledge distillation, contrastive loss and mean square error loss to jointly optimize the EnhanceNet and the student network for the generation of high-quality PSSMs and accurate PSSP
- There are 20 common amino acids that function as the building blocks of a protein sequence.
- The PSSM indicates the substitution log-likelihood of all the 20 aminoacid types at each position, based on homologous sequences.
- PSSM of a protein sequence, denoted by X, is defined as Xk,j = log( Pk,j Bk ), where.
- K is one kind of amino acids and j ∈ (1, .., L) with L denoting the length of the protein sequence
- P is the position probability matrix and B is the background frequency matrix. k is one kind of amino acids and j ∈ (1, .., L) with L denoting the length of the protein sequence
- The authors evaluate the PSSM-Distil framework on low-quality PSSM protein sequences from three public datasets: CullPDB, CB513 and BC40.
- To give a more detailed comparison, the authors split the protein sequences with low-quality PSSM into several divisions of MSA count and MSA meff according to Guo et al (2020).
- Shown in Table 2 and Table 3, the approach achieves the best performance on protein sequences with low-quality PSSMs under regardless of low MSA count score or low MSA meff score settings.
- The authors present the PSSM-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of PSSP on lowquality PSSM.
- The authors jointly train the EnhanceNet and a student network for PSSM enhancement and PSSP by using the low-quality PSSMs, which are downsampled from high-quality PSSMs. Regardless of the sophisticated architecture design, the novel loss function is elaborated with knowledge distillation, contrastive loss and mean square error loss to jointly optimize the EnhanceNet and the student network for the generation of high-quality PSSMs and accurate PSSP.
- The authors explicitly utilize the BERT pseudo PSSM for extreme low-quality cases’ enhancement, i.e., protein sequences with no homology at all.
- Table1: Details of dataset used in our experiments, including dataset names, types and number of proteins sequence
- Table2: PSSP results on BC40, CullPDB and CB513 test sets for for protein sequence with low-quality PSSM leveled by MSA count score. The “MSA Counts” stands for the number of alignment sequences in the MSA of a protein sequence. The “Number” column stands for the number of the protein sequences in the datasets that their searched MSAs falling in the MSA Counts category. The “Real” column is the baseline result without any enhancement technique. The “Bagging” column is the result of a previous data enhancement method. Our experimental results show large improvement over the baseline method and “Bagging”
- Table3: PSSP results on BC40 and CullPDB for protein sequence with low-quality PSSM leveled by Meff score
- Table4: Ablation study results on the BC40 dataset of our method. The “Our” column is the full pack result of the our method. “w/o BERT” is the result without BERT Pseudo PSSM Xb. “w/o CL” is the result without the triplet loss Lt from constrastive learning. “w/o MSE” is the result without MSE loss Lm between Xe and Xh. The obvious degeneration from ablating each component from our method implies the important role of these components for our method
- Multiple Sequence Alignment (MSA) MSA is a sequence alignment of multiple homologous protein sequences for a target protein (Wang and Jiang 1994). It is a key technique for modeling sequence relationships in computational biology. Given a protein database and a protein sequence, MSA is searched by performing pairwise comparisons (Altschul et al 1990), Hidden Markov Model-like probabilistic models (Eddy 1998; Johnson, Eddy, and Portugaly 2010; Remmert et al 2012), or a combination of both (Altschul et al 1997) to align the sequence against the given database. Once MSA is conducted, it is usually transferred to the PositionSpecific Scoring Matrix (PSSM) for subsequent tasks.
Low-quality PSSM Enhancement. Since MSA and PSSM are critical for protein property prediction, “Bagging” (Guo et al 2020) is the first attempt to enhance the lowquality PSSM. By minimizing the MSE loss between the reconstructed and original PSSM, “Bagging” reconstructs high-quality MSA from down-sampled MSA with lowquality PSSM via an unsupervised method. Even though “Bagging” is the first work to achieve a relatively satisfactory performance, there are still some limitations. First, it exploits a fixed ratio for MSA down-sampling to obtain the low-quality PSSM, which makes the “Bagging” model less robust, especially for sequences with extremely low homology. Second, “Bagging” only conducts PSSM enhancement while ignoring the joint optimization of PSSM and the final PSSP.
- The work was supported in part by the Key Area RD Program of Guangdong Province with grant No 2018B030338001, by the National Key RD Program of China with grant No 2018YFB1800800, by NSFCYouth 61902335, by Guangdong Regional Joint Fund-Key Projects 2019B1515120039, by Shenzhen Outstanding Talents Training Fund, by Guangdong Research Project No 2017ZT07X152 and by CCF-Tencent Open Fund
public datasets: 3
Benefiting from the low-quality PSSMs Xl achieved through domain aligned down-sampling, our EnhanceNet can output more realistic high quality Xe, leading to superior and robust performance than previous methods. We evaluate our PSSM-Distil framework on low-quality PSSM protein sequences from three public datasets: CullPDB, CB513 and BC40. The comparison experiment with the previous state-of-the-art models confirms the supreme priority of our approach
- Alley, E. C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; and Church, G. M. 2019. Unified rational protein engineering with sequence-based deep representation learning. Nature methods 16(12): 1315–1322.
- Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; and Lipman, D. J. 1990. Basic local alignment search tool. Journal of molecular biology 215(3): 403–410.
- Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; and Lipman, D. J. 1997.
- Bepler, T.; and Berger, B. 2019. Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661.
- Bucilua, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 535–541.
- Chen, G.; Choi, W.; Yu, X.; Han, T.; and Chandraker, M. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, 742–751.
- Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
- Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Eddy, S. R. 1998. Profile hidden Markov models. Bioinformatics (Oxford, England) 14(9): 755–763.
- Guo, Y.; Wu, J.; Ma, H.; Wang, S.; and Huang, J. 2020. Bagging MSA Learning: Enhancing Low-Quality PSSM with Deep Learning for Accurate Protein Structure Property Prediction. In International Conference on Research in Computational Molecular Biology, 88–103. Springer.
- Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, 1735–1742. IEEE.
- He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
- Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; and Rost, B. 2019. Modeling the Language of Life-Deep Learning Protein Sequences. bioRxiv 614313.
- Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Johnson, L. S.; Eddy, S. R.; and Portugaly, E. 2010. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC bioinformatics 11(1): 431.
- Kabsch, W.; and Sander, C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules 22(12): 2577–2637.
- Kryshtafovych, A.; Barbato, A.; Fidelis, K.; Monastyrskyy, B.; Schwede, T.; and Tramontano, A. 2014. Assessment of the assessment: evaluation of the model quality estimates in CASP10. Proteins: Structure, Function, and Bioinformatics 82: 112–126.
- Li, Z.; and Yu, Y. 2016. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176.
- Mandell, D. J.; and Kortemme, T. 2009. Computer-aided design of functional protein interactions. Nature chemical biology 5(11): 797–807.
- Mirzadeh, S.-I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; and Ghasemzadeh, H. 2019. Improved Knowledge Distillation via Teacher Assistant. arXiv preprint arXiv:1902.03393.
- Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6707–6717.
- Noble, M. E.; Endicott, J. A.; and Johnson, L. N. 2004. Protein kinase inhibitors: insights into drug design from structure. Science 303(5665): 1800–1805.
- Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
- Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8): 9.
- Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; and Song, Y. 2019. Evaluating protein transfer learning with TAPE. In Advances in Neural Information Processing Systems, 9686–9698.
- Remmert, M.; Biegert, A.; Hauser, A.; and Soding, J. 2012. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods 9(2): 173.
- Rives, A.; Goyal, S.; Meier, J.; Guo, D.; Ott, M.; Zitnick, C. L.; Ma, J.; and Fergus, R. 2019. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv 622803.
- Schmitt, S.; Hudson, J. J.; Zidek, A.; Osindero, S.; Doersch, C.; Czarnecki, W. M.; Leibo, J. Z.; Kuttler, H.; Zisserman, A.; Simonyan, K.; et al. 2018. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835.
- Sønderby, S. K.; and Winther, O. 2014. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828.
- Steinegger, M.; and Soding, J. 2017. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology 35(11): 1026–1028.
- Suzek, B. E.; Wang, Y.; Huang, H.; McGarvey, P. B.; Wu, C. H.; and Consortium, U. 2015. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31(6): 926–932.
- Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699.
- Wang, G.; and Dunbrack Jr, R. L. 2003. PISCES: a protein sequence culling server. Bioinformatics 19(12): 1589–1591.
- Wang, L.; and Jiang, T. 1994. On the complexity of multiple sequence alignment. Journal of computational biology 1(4): 337–348.
- Wang, R. Y.-R.; Kudryashev, M.; Li, X.; Egelman, E. H.; Basler, M.; Cheng, Y.; Baker, D.; and DiMaio, F. 2015.
- Wang, S.; Peng, J.; Ma, J.; and Xu, J. 2016. Protein secondary structure prediction using deep convolutional neural fields. Scientific reports 6(1): 1–11.
- Wuthrich, K. 1989. Protein structure determination in solution by nuclear magnetic resonance spectroscopy. Science 243(4887): 45–50.
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, 5754–5764.
- Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4133–4141.
- Yu, R.; Li, A.; Morariu, V. I.; and Davis, L. S. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision, 1974–1982.
- Zhou, J.; and Troyanskaya, O. G. 2014. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. arXiv preprint arXiv:1403.1347.
- Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, 6002–6012.