Adversarial Bandits Policy for Crawling Commercial Web Content

Shuguang Han,Michael Bendersky,Przemek Gajda,Sergey Novikov,Marc Najork,Bernhard Brodowsky,Alexandrin Popescul

WWW '20: The Web Conference 2020 Taipei Taiwan April, 2020（2020）

引用 4|浏览87

暂无评分

摘要

The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the K-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets.

查看译文

关键词

Predictive Crawling, Commercial Web Crawling, Adversarial Bandit

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要