A Thread-wise Strategy for Incremental Crawling of Web Forums

Jiang-Ming Yang,Rui Cai, Chunsong Wang, Hua Huang,Lei Zhang,Wei-Ying Ma

google(2008)

引用 2|浏览36
暂无评分
摘要
We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages are usually inefficient in crawling forum sites because of different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a thread-wise strategy by taking into account thread-level statistics, for example, the number of replies and the frequency of replies, to estimate the activity trend of each thread. To extract such statistical information, we develop a simple yet very robust approach to extracting the timestamp of each post in a discussion thread. We also employ a regression model to predict the time of the next post for each thread. Based on this model, we developed a highly efficient crawler which is 2.6 times faster than state-of-the-art methods in terms of fetching new generated content, and meanwhile can still ensure a high coverage ratio. Experimental results show encouraging performance of Coverage, Bandwidth utilization, and Age for our approach on various forums.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要