Automated Detection of Health Websites' HONcode Conformity: Can N-gram Tokenization Replace Stemming?

Studies in Health Technology and Informatics(2015)

引用 6|浏览32
暂无评分
摘要
Authors evaluated supervised automatic classcation algorithms for determination of health related web page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web page documents. The training/testing collection comprised web page fragments extracted by HONcode experts during the manual certification process. The authors compared automated classification performance of n-gram tokenization to the automated classcation performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace word-based classification. Using 5 grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. Overall study results indicate that n gram tokenization provided a potentially viable alternative to document word stemming.
更多
查看译文
关键词
Machine learning,N-gram,HONcode
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要