Empirical analysis of chronic disease dataset for multiclass classification using optimal feature selection based hybrid model with spark streaming

Future Gener. Comput. Syst.(2023)

引用 2|浏览9
暂无评分
摘要
Recent advancement in the field of pervasive healthcare monitoring systems causes the generation of a huge amount of lifelog data in real-time. Chronic diseases are one of the most serious health challenges in developing and developed countries. According to WHO, this accounts for 73% of all deaths and 60% of the global burden of diseases. Chronic disease classification models are now harnessing the potential of lifelog data to explore better healthcare practices. This paper basically constructs an optimal feature selection-based hybrid model which is designed by integrating k-means clustering for handling unlabeled data, applying Synthetic Minority Oversampling Technique (SMOTE), selecting optimal features using PCA, and applying Random Forest (RF) classifier to classify chronic diseases. Since lifelog data analysis is crucial due to its sensitive nature; thus the conventional classification models show limited performance. Therefore, designing new classifiers for the classification of chronic diseases using lifelog data is the need of the age. The vital part of building a good model depends on pre-processing of the dataset, identifying important features, and then training a learning algorithm with suitable hyper parameters for better performance. The proposed approach improves the performance of existing methods using a series of steps such as (i) removing redundant or invalid instances, (ii) making the data labeled using clustering and partitioning the data into classes, (iii) making the classes balance to avoid biased result across majority class (iv) identifying the suitable subset of features by applying either some domain knowledge or selection algorithm, (v) hyper parameter tuning for models to get best results (vi) developing a new hybrid model i.e. optimal feature selection based unsupervised random forest classifier (OFS-URFC), which gives best results in binary classification as well as multiclass classification and (vii) performance of this newly designed classifier is also evaluated using Spark streaming environment for processing the data in real time fashion. For this purpose, two-time series datasets are used in the experiment to compute the accuracy, recall, precision, and f1-score etc. The experimental analysis proves the suitability of the proposed approach as compared to the conventional classifiers and our newly constructed model achieved highest accuracy and reduced training complexity among all.
更多
查看译文
关键词
Chronic diseases,Lifelog data,Optimal feature selection,Hyper parameter,Machine learning,Classification,Clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要