AlzGenPred: A CatBoost based method using network features to classify the Alzheimer’s Disease associated genes from the high throughput sequencing data

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 0|浏览0
暂无评分
摘要
Background and Objective AD is a progressive neurodegenerative disorder characterized by memory loss. Due to the advancement in next-generation sequencing technologies, an enormous amount of AD-associated genomics data is available. However, the information about the involvement of these genes in AD association is still a research topic because all these algorithms are based on statistical techniques. Therefore, AlzGenPred is developed to identify the AD-associated genes from a large set of data. Methods To develop the AlzGenPred, we have compiled a benchmark dataset consisting of 1086 AD and non-AD genes and used them as positive and negative datasets. We have generated several features including the fused features and evaluated them through machine learning methods. Then hyperparameter tuning approach was also applied and the final model was selected. The proposed method was validated by using the AlzGene and transcriptomics datasets and proposed as a standalone tool. Results Total 13504 features belonging to eight different encoding schemes of these sequences were generated and evaluated by using 16 ML algorithms. It reveals that network-based features can classify AD genes while sequence-based features are not able to classify them. Then we generated 24 different fused features (6020 D) using sequence-based features and fed them into a two-step lightGBM-based recursive feature selection method. It increased up to 5-7% accuracy. After that selected eight fused features with CKSAAP were used for the hyperparameter tuning. They showed <70% accuracy. Therefore, network-based features were used to generate the CatBoost-based ML method called AlzGenPred with 96.55% accuracy and 98.99% AUROC. The developed method is tested on the AlzGene dataset where it showed 96.43% accuracy. Then the model is validated using the transcriptomics dataset also. Conclusion The validation of AlzGenPred using the AlzGene dataset and transcriptomics dataset obtained from Human, mouse, and ES-derived neural cells revealed that it can classify the omics data and can sort the AD-associated genes. These predicted genes can be directly used in the wet lab for further testing which will reduce labor cost and time expenses. The AlzGenPred is developed as a standalone package and is available for users at and . ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
alzheimers disease,catboost
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要