Automated Detection of Health Websites' HONcode Conformity: Can N-gram Tokenization Replace Stemming?
Studies in Health Technology and Informatics(2015)
摘要
Authors evaluated supervised automatic classcation algorithms for determination of health related web page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web page documents. The training/testing collection comprised web page fragments extracted by HONcode experts during the manual certification process. The authors compared automated classification performance of n-gram tokenization to the automated classcation performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace word-based classification. Using 5 grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. Overall study results indicate that n gram tokenization provided a potentially viable alternative to document word stemming.
更多查看译文
关键词
Machine learning,N-gram,HONcode
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要