Visual Item Selection With Voice Assistants

COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023(2023)

引用 0|浏览30
暂无评分
摘要
Interacting with voice assistants, such as Amazon Alexa to aid in day-to-day tasks has become a ubiquitous phenomenon in modern-day households. These voice assistants often have screens to provide visual content (e.g., images, videos) to their users. There is an increasing trend of users shopping or searching for products using these devices, yet, these voice assistants do not support commands or queries that contain visual references to the content shown on screen (e.g., "blue one", "red dress"). We introduce a novel multi-modal visual shopping experience where the voice assistant is aware of the visual content shown on the screen and assists the user in item selection using natural language multi-modal interactions. We detail a practical, lightweight end-to-end system architecture spanning from model fne-tuning, deployment, to skill invocation on an Amazon Echo family device with a screen. We also defne a niche "Visual Item Selection" task and evaluate whether we can efectively leverage publicly available multi-modal models, and embeddings produced from these models for the task. We show that open source contrastive embeddings like CLIP [30] and ALBEF [24] have zero-shot accuracy above 70% for the "Visual Item Selection" task on an internally collected visual shopping dataset. By further fne-tuning the embeddings, we obtain further gains of 8.6% to 24.0% in relative accuracy improvement over a baseline. The technology that enables our visual shopping assistant is available as an Alexa Skill in the Alexa Skills store.
更多
查看译文
关键词
Multi-modality,voice assistants,systems architecture,visual shopping,deployed system,app based skills
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要