HSPXY: A hybrid-correlation and diversity-distances based data partition method: HSPXY: An Improved Partition Method

JOURNAL OF CHEMOMETRICS(2019)

引用 8|浏览13
暂无评分
摘要
A representative dataset is crucial to build a robust and generalized machine learning model, especially for small databases. Correlation is not usually considered in distance-based set partition methods; therefore, distant yet correlated samples might be incorrectly assigned. An improved sample subset partition method based on joint hybrid correlation and diversity x-y distances (HSPXY) is proposed in the framework of the sample set partition based on joint x-y distances (SPXY). Therein, a hybrid distance consisting of both cosine angle distance and Euclidean distance in variable spaces cooperates the correlation of samples in the distance-based set partition method. To compare with some existing partition methods, partial least squares (PLS) regression models are built on four set partition methods, random sampling (RS), Kennard-Stone (KS), SPXY, and HSPXY. Upon the applications on small chemical databases, the proposed HSPXY algorithm-based models achieved smaller root mean square errors and better coefficients of determination than other tested set partition methods, which indicates the training set is well represented. This suggests the proposed algorithm provides a new option to obtain a representative calibration set. Sample subset partition is widely considered in machine learning modeling. An improved sample subset partition method based on a hybrid correlation and diversity x-y distance (HSPXY) is proposed in the framework of SPXY. Cosine angle distance and Euclidean distance in variable spaces are used to represent the correlation and diversity of samples, respectively. To explore the effectiveness of HSPXY, PLS models are built on four set partition methods, RS, KS, SPXY, and HSPXY. The models based on the proposed HSPXY algorithm carried the overall best result among all regression models, which suggests the proposed algorithm may be taken as an alternative to other existing data partition methods.
更多
查看译文
关键词
HSPXY,Kennard-Stone (KS),partial least squares (PLS),sample set partitioning based on joint x-y distances SPXY,set partition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要