AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We present the Position-Specific Scoring Matrix-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of Protein secondary structure prediction on lowquality PSSM

Pssm-Distil: Protein Secondary Structure Prediction (Pssp) On Low-Quality Pssm By Knowledge Distillation With Contrastive Learning

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLI..., (2021): 617-625

被引用0|浏览276
EI
下载 PDF 全文
引用
微博一下

摘要

Protein secondary structure prediction (PSSP) is an essential task in computational biology. To achieve the accurate PSSP, the general and vital feature engineering is to use multiple sequence alignment (MSA) for Position-Specific Scoring Matrix (PSSM) extraction. However, when only low-quality PSSM can be obtained due to poor sequence ho...更多

代码

数据

0
简介
  • Especially protein tertiary (3D) structure, plays a critical role for practical protein applications, such as the understanding of the protein functions and the design of drugs (Noble, Endicott, and Johnson 2004).
  • There are three mainstream methods for protein tertiary structure (3D) prediction, i.e., X-ray crystallography and nuclear magnetic resonance (NMR) (Wuthrich 1989), cryo-EM based methods (Wang et al 2015) and computeraided ab initio prediction (Mandell and Kortemme 2009).
  • Given the extremely time-consuming drawback of X-ray crystallography, the sequence length limitation of nuclear magnetic resonance (NMR) and the expensive equipment requirement for cryo-EM, computer-assisted protein structure
重点内容
  • Protein structure analysis, especially protein tertiary (3D) structure, plays a critical role for practical protein applications, such as the understanding of the protein functions and the design of drugs (Noble, Endicott, and Johnson 2004)
  • To give a more detailed comparison, we split the protein sequences with low-quality Position-Specific Scoring Matrix (PSSM) into several divisions of multiple sequence alignment (MSA) count and MSA meff according to Guo et al (2020)
  • Shown in Table 2 and Table 3, our approach achieves the best performance on protein sequences with low-quality PSSMs under regardless of low MSA count score or low MSA meff score settings
  • We present the PSSM-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of Protein secondary structure prediction (PSSP) on lowquality PSSM
  • We jointly train the EnhanceNet and a student network for PSSM enhancement and PSSP by using the low-quality PSSMs, which are downsampled from high-quality PSSMs
  • We remove the proteins that share more than 25% sequence identity with our CullPDB dataset
  • Regardless of the sophisticated architecture design, the novel loss function is elaborated with knowledge distillation, contrastive loss and mean square error loss to jointly optimize the EnhanceNet and the student network for the generation of high-quality PSSMs and accurate PSSP
方法
  • There are 20 common amino acids that function as the building blocks of a protein sequence.
  • The PSSM indicates the substitution log-likelihood of all the 20 aminoacid types at each position, based on homologous sequences.
  • PSSM of a protein sequence, denoted by X, is defined as Xk,j = log( Pk,j Bk ), where.
  • K is one kind of amino acids and j ∈ (1, .., L) with L denoting the length of the protein sequence
  • P is the position probability matrix and B is the background frequency matrix. k is one kind of amino acids and j ∈ (1, .., L) with L denoting the length of the protein sequence
结果
  • The authors evaluate the PSSM-Distil framework on low-quality PSSM protein sequences from three public datasets: CullPDB, CB513 and BC40.
  • To give a more detailed comparison, the authors split the protein sequences with low-quality PSSM into several divisions of MSA count and MSA meff according to Guo et al (2020).
  • Shown in Table 2 and Table 3, the approach achieves the best performance on protein sequences with low-quality PSSMs under regardless of low MSA count score or low MSA meff score settings.
结论
  • The authors present the PSSM-Distil, a PSSM enhancement method with knowledge distillation and contrastive learning to tackle the problem of PSSP on lowquality PSSM.
  • The authors jointly train the EnhanceNet and a student network for PSSM enhancement and PSSP by using the low-quality PSSMs, which are downsampled from high-quality PSSMs. Regardless of the sophisticated architecture design, the novel loss function is elaborated with knowledge distillation, contrastive loss and mean square error loss to jointly optimize the EnhanceNet and the student network for the generation of high-quality PSSMs and accurate PSSP.
  • The authors explicitly utilize the BERT pseudo PSSM for extreme low-quality cases’ enhancement, i.e., protein sequences with no homology at all.
表格
  • Table1: Details of dataset used in our experiments, including dataset names, types and number of proteins sequence
  • Table2: PSSP results on BC40, CullPDB and CB513 test sets for for protein sequence with low-quality PSSM leveled by MSA count score. The “MSA Counts” stands for the number of alignment sequences in the MSA of a protein sequence. The “Number” column stands for the number of the protein sequences in the datasets that their searched MSAs falling in the MSA Counts category. The “Real” column is the baseline result without any enhancement technique. The “Bagging” column is the result of a previous data enhancement method. Our experimental results show large improvement over the baseline method and “Bagging”
  • Table3: PSSP results on BC40 and CullPDB for protein sequence with low-quality PSSM leveled by Meff score
  • Table4: Ablation study results on the BC40 dataset of our method. The “Our” column is the full pack result of the our method. “w/o BERT” is the result without BERT Pseudo PSSM Xb. “w/o CL” is the result without the triplet loss Lt from constrastive learning. “w/o MSE” is the result without MSE loss Lm between Xe and Xh. The obvious degeneration from ablating each component from our method implies the important role of these components for our method
Download tables as Excel
相关工作
  • Multiple Sequence Alignment (MSA) MSA is a sequence alignment of multiple homologous protein sequences for a target protein (Wang and Jiang 1994). It is a key technique for modeling sequence relationships in computational biology. Given a protein database and a protein sequence, MSA is searched by performing pairwise comparisons (Altschul et al 1990), Hidden Markov Model-like probabilistic models (Eddy 1998; Johnson, Eddy, and Portugaly 2010; Remmert et al 2012), or a combination of both (Altschul et al 1997) to align the sequence against the given database. Once MSA is conducted, it is usually transferred to the PositionSpecific Scoring Matrix (PSSM) for subsequent tasks.

    Low-quality PSSM Enhancement. Since MSA and PSSM are critical for protein property prediction, “Bagging” (Guo et al 2020) is the first attempt to enhance the lowquality PSSM. By minimizing the MSE loss between the reconstructed and original PSSM, “Bagging” reconstructs high-quality MSA from down-sampled MSA with lowquality PSSM via an unsupervised method. Even though “Bagging” is the first work to achieve a relatively satisfactory performance, there are still some limitations. First, it exploits a fixed ratio for MSA down-sampling to obtain the low-quality PSSM, which makes the “Bagging” model less robust, especially for sequences with extremely low homology. Second, “Bagging” only conducts PSSM enhancement while ignoring the joint optimization of PSSM and the final PSSP.
基金
  • The work was supported in part by the Key Area RD Program of Guangdong Province with grant No 2018B030338001, by the National Key RD Program of China with grant No 2018YFB1800800, by NSFCYouth 61902335, by Guangdong Regional Joint Fund-Key Projects 2019B1515120039, by Shenzhen Outstanding Talents Training Fund, by Guangdong Research Project No 2017ZT07X152 and by CCF-Tencent Open Fund
研究对象与分析
public datasets: 3
Benefiting from the low-quality PSSMs Xl achieved through domain aligned down-sampling, our EnhanceNet can output more realistic high quality Xe, leading to superior and robust performance than previous methods. We evaluate our PSSM-Distil framework on low-quality PSSM protein sequences from three public datasets: CullPDB, CB513 and BC40. The comparison experiment with the previous state-of-the-art models confirms the supreme priority of our approach

引用论文
  • Alley, E. C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; and Church, G. M. 2019. Unified rational protein engineering with sequence-based deep representation learning. Nature methods 16(12): 1315–1322.
    Google ScholarLocate open access versionFindings
  • Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; and Lipman, D. J. 1990. Basic local alignment search tool. Journal of molecular biology 215(3): 403–410.
    Google ScholarLocate open access versionFindings
  • Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; and Lipman, D. J. 1997.
    Google ScholarFindings
  • Bepler, T.; and Berger, B. 2019. Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661.
    Findings
  • Bucilua, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 535–541.
    Google ScholarLocate open access versionFindings
  • Chen, G.; Choi, W.; Yu, X.; Han, T.; and Chandraker, M. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, 742–751.
    Google ScholarLocate open access versionFindings
  • Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
    Findings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Eddy, S. R. 1998. Profile hidden Markov models. Bioinformatics (Oxford, England) 14(9): 755–763.
    Google ScholarLocate open access versionFindings
  • Guo, Y.; Wu, J.; Ma, H.; Wang, S.; and Huang, J. 2020. Bagging MSA Learning: Enhancing Low-Quality PSSM with Deep Learning for Accurate Protein Structure Property Prediction. In International Conference on Research in Computational Molecular Biology, 88–103. Springer.
    Google ScholarLocate open access versionFindings
  • Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, 1735–1742. IEEE.
    Google ScholarLocate open access versionFindings
  • He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
    Google ScholarLocate open access versionFindings
  • Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; and Rost, B. 2019. Modeling the Language of Life-Deep Learning Protein Sequences. bioRxiv 614313.
    Google ScholarFindings
  • Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Johnson, L. S.; Eddy, S. R.; and Portugaly, E. 2010. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC bioinformatics 11(1): 431.
    Google ScholarLocate open access versionFindings
  • Kabsch, W.; and Sander, C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules 22(12): 2577–2637.
    Google ScholarLocate open access versionFindings
  • Kryshtafovych, A.; Barbato, A.; Fidelis, K.; Monastyrskyy, B.; Schwede, T.; and Tramontano, A. 2014. Assessment of the assessment: evaluation of the model quality estimates in CASP10. Proteins: Structure, Function, and Bioinformatics 82: 112–126.
    Google ScholarLocate open access versionFindings
  • Li, Z.; and Yu, Y. 2016. Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176.
    Findings
  • Mandell, D. J.; and Kortemme, T. 2009. Computer-aided design of functional protein interactions. Nature chemical biology 5(11): 797–807.
    Google ScholarLocate open access versionFindings
  • Mirzadeh, S.-I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; and Ghasemzadeh, H. 2019. Improved Knowledge Distillation via Teacher Assistant. arXiv preprint arXiv:1902.03393.
    Findings
  • Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6707–6717.
    Google ScholarLocate open access versionFindings
  • Noble, M. E.; Endicott, J. A.; and Johnson, L. N. 2004. Protein kinase inhibitors: insights into drug design from structure. Science 303(5665): 1800–1805.
    Google ScholarLocate open access versionFindings
  • Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
    Findings
  • Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
    Google ScholarFindings
  • Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
    Findings
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8): 9.
    Google ScholarLocate open access versionFindings
  • Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; and Song, Y. 2019. Evaluating protein transfer learning with TAPE. In Advances in Neural Information Processing Systems, 9686–9698.
    Google ScholarLocate open access versionFindings
  • Remmert, M.; Biegert, A.; Hauser, A.; and Soding, J. 2012. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods 9(2): 173.
    Google ScholarLocate open access versionFindings
  • Rives, A.; Goyal, S.; Meier, J.; Guo, D.; Ott, M.; Zitnick, C. L.; Ma, J.; and Fergus, R. 2019. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv 622803.
    Google ScholarFindings
  • Schmitt, S.; Hudson, J. J.; Zidek, A.; Osindero, S.; Doersch, C.; Czarnecki, W. M.; Leibo, J. Z.; Kuttler, H.; Zisserman, A.; Simonyan, K.; et al. 2018. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835.
    Findings
  • Sønderby, S. K.; and Winther, O. 2014. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828.
    Findings
  • Steinegger, M.; and Soding, J. 2017. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology 35(11): 1026–1028.
    Google ScholarLocate open access versionFindings
  • Suzek, B. E.; Wang, Y.; Huang, H.; McGarvey, P. B.; Wu, C. H.; and Consortium, U. 2015. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31(6): 926–932.
    Google ScholarLocate open access versionFindings
  • Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699.
    Findings
  • Wang, G.; and Dunbrack Jr, R. L. 2003. PISCES: a protein sequence culling server. Bioinformatics 19(12): 1589–1591.
    Google ScholarLocate open access versionFindings
  • Wang, L.; and Jiang, T. 1994. On the complexity of multiple sequence alignment. Journal of computational biology 1(4): 337–348.
    Google ScholarLocate open access versionFindings
  • Wang, R. Y.-R.; Kudryashev, M.; Li, X.; Egelman, E. H.; Basler, M.; Cheng, Y.; Baker, D.; and DiMaio, F. 2015.
    Google ScholarFindings
  • Wang, S.; Peng, J.; Ma, J.; and Xu, J. 2016. Protein secondary structure prediction using deep convolutional neural fields. Scientific reports 6(1): 1–11.
    Google ScholarLocate open access versionFindings
  • Wuthrich, K. 1989. Protein structure determination in solution by nuclear magnetic resonance spectroscopy. Science 243(4887): 45–50.
    Google ScholarLocate open access versionFindings
  • Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, 5754–5764.
    Google ScholarLocate open access versionFindings
  • Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4133–4141.
    Google ScholarLocate open access versionFindings
  • Yu, R.; Li, A.; Morariu, V. I.; and Davis, L. S. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision, 1974–1982.
    Google ScholarLocate open access versionFindings
  • Zhou, J.; and Troyanskaya, O. G. 2014. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. arXiv preprint arXiv:1403.1347.
    Findings
  • Zhuang, C.; Zhai, A. L.; and Yamins, D. 2019. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, 6002–6012.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科