How Well do Offline Metrics Predict Online Performance of Product Ranking Models?

Xiaojie Wang,Ruoyuan Gao, Anoop Jain, Graham Edge,Sachin Ahuja

PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023(2023)

引用 1|浏览8
暂无评分
摘要
Online evaluation techniques are widely adopted by industrial search engines to determine which ranking models perform better under a certain business metric. However, online evaluation can only evaluate a small number of rankers and people resort to offline evaluation to select rankers that are likely to yield good online performance. To use offline metrics for effective model selection, a major challenge is to understand how well offline metrics predict which ranking models perform better in online experiments. This paper aims to address this challenge in product search ranking. Towards this end, we collect gold data in the form of preferences over ranker pairs under a business metric in e-commerce search engine. For the first time, we use such gold data to evaluate offline metrics in terms of directional agreement with the business metric. Furthermore, we analyze offline metrics in terms of discriminative power through paired sample t-test and rank correlations among offline metrics. Through extensive online and offline experiments, we studied 36 offline metrics and observed that: (1) Offline metrics align well with online metrics: they agree on which one of two ranking models is better up to 97% of times; (2) Offline metrics are highly discriminative on large-scale search ranking data, especially NDCG (Normalized Discounted Cumulative Gain) which has a discriminative power over 99%.
更多
查看译文
关键词
Evaluation metrics,online evaluation,offline evaluation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要