Offline vs. Online Evaluation in Voice Product Search


引用 2|浏览5
1 BACKGROUND Intelligent voice assistants such as Amazon Alexa, Google Assistant, and Apple Siri have been recently gaining popularity. One emerging usage of such assistants is shopping. Being able to restock toilet paper after using the last roll, or ordering garlic, while adding the last clove in your pasta sauce, is a new habit that people are acquiring. In the traditional usage scenario, customers issue a product search query by voice, and get back a list of candidate products on which they can take some actions such as add-to-cart or buy. Voice product search introduces a new paradigm and drives users’ behavior to drastically differ from other domains with closed collections such as Web Mail or Web Product search. As the output is spoken, customers are exposed to fewer results, with much less information. Positive shopping actions, such as buy and add-to-cart, are much more frequent on the first result than in other domains. Before a ranking model is pushed into production, a common practice is first to evaluate it offline. Offline experiments are much easier to conduct than online experiments and therefore allow for faster iteration in algorithmic improvements. They also act as a safeguard in the hope that they catch defective models before they are being tested on real users, even if in limited numbers. Offline evaluation relies on a dataset with relevance judgments, however, in our context, most judgments are associated with the first result presented. Such datasets are typically derived from historical search logs, which includes search queries, associated results, and actions taken by real users on these results. In this workshop, we would like to argue that traditional search evaluation methods cannot be used “as is" in voice product search. We will show that log-based offline experiments do not sufficiently correlatewith online results to be valuable. This has also been demonstrated recently by Carterette et al. [1] in other domains. Besides, voice shopping still being a new habit, online experiments might be riskier than in other environments such as Web search, as negatively affected users might not try the experience again. We hope to discuss with other attendees the need to invent new types of offline experiments that would be less sensitive to the display order, and of online experiments when data is relatively scarce.
AI 理解论文
Chat Paper