Text Categorization Using n-Gram Based Language Independent Technique

semanticscholar(2014)

引用 2|浏览0
暂无评分
摘要
This paper presents a language and topic independent, bytelevel n-gram technique for topic-based text categorization. The technique relies on an n-gram frequency statistics method for document representation, and a variant of k nearest neighbors machine learning algorithm for categorization process. It does not require any morphological analysis of texts, any preprocessing steps, or any prior information about document content or language. For driving experiments, five document collections are used: Ebart-3 in Serbian, Reuters-21578 and 20-Newsgroups in English, Tancorp-12 in Chinese and Maslah-10 in Arabic. Microand macro-averaged F1 measures are employed for evaluation process. Comparisons between results obtained by the presented technique and results obtained by other n-gram based and traditional ”bag of words” text categorization techniques, demonstrate that this technique is sound and promising.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要