谷歌浏览器插件
订阅小程序
在清言上使用

Automatic tagging of persian web pages based on n-gram language models using mapreduce

soft computing(2015)

引用 0|浏览15
暂无评分
摘要
Page tagging is one of the most important facilities for increasing the accuracy of information retrieval in the web. Tags are simple pieces of data that usually consist of one or several words, and briefly describe a page. Tags provide useful information about a page and can be used for boosting the accuracy of searching, document clustering, and result grouping. The most accurate solution to page tagging is using human experts. However, when the number of pages is large, humans cannot be used, and some automatic solutions should be used instead. We propose a solution called PerTag which can automatically tag a set of Persian web pages. PerTag is based on n-gram models and uses the tf-idf method plus some effective Persian language rules to select proper tags for each web page. Since our target is huge sets of web pages, PerTag is built on top of the MapReduce distributed computing framework. We used a set of more than 500 million Persian web pages during our experiments, and extracted tags for each page using a cluster of 40 machines. The experimental results show that PerTag is both fast and accurate
更多
查看译文
关键词
Persian Web,Automatic Web Page Tagging,MapReduce
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要