To be or not to be IID: Can Zipf's Law help?

Leo Behe, Zachary Wheeler,Christie Nelson, Brian Knopp,William M Pottenger

2015 IEEE International Symposium on Technologies for Homeland Security (HST)(2015)

引用 0|浏览27
暂无评分
摘要
Classification is a popular problem within machine learning, and increasing the effectiveness of classification algorithms has many significant applications within industry and academia. In particular, focus will be given to Higher-Order Naive Bayes (HONB), a relational variant of the famed Naive Bayes (NB) statistical classification algorithm that has been shown to outperform Naive Bayes in many cases [1,10]. Specifically, HONB has outperformed NB on character n-gram based feature spaces when the available training data is small [2]. In this paper, a correlation is hypothesized between the performance of HONB on character n-gram feature spaces and how closely the feature space distribution follows Zipf's Law. This hypothesis stems from the overarching goal of ultimately understanding HONB and knowing when it will outperform NB. Textual datasets ranging from several thousand instances to nearly 20,000 instances, some containing microtext, were used to generate character n-gram feature spaces. HONB and NB were both used to model these datasets, using varying character n-gram sizes (2-7) and dictionary sizes up to 5000 features. The performances of HONB and NB were then compared, and the results show potential support for our hypothesis: namely, the results support the hypothesized correlation for the Accuracy and Precision metrics. Additionally, a solution is provided for an open problem which was presented in [1], giving an explicit formula for the number of SDRs from k given sets, which has connections to counting higher-order paths of arbitrary length, which are important in the learning stage of HONB.
更多
查看译文
关键词
Zipf's law,machine learning,classification algorithms,higher-order naive Bayes,HONB,naive Bayes statistical classification algorithm,character n-gram based feature spaces,character n-gram feature spaces,feature space distribution,textual datasets,accuracy metrics,precision metrics,IDD,independent and identically distributed
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要