Fashion-GPT: Integrating LLMs with Fashion Retrieval System

Qianqian Chen,Tianyi Zhang,Maowen Nie,Zheng Wang, Shihao Xu,Wei Shi,Zhao Cao

PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023（2023）

引用 0|浏览16

暂无评分

摘要

Customers on a fashion e-commerce platform although expressing their clothing preferences through combined imagery and textual information, they are limited to retrieve with single-round fixed inputs. At the same time, large language models (LLMs) have been gaining attention across various fields. ChatGPT is a remarkable example of an LLM, known for its user-friendly language interface, impressive conversational proficiency, and reasoning abilities. To this end, we propose Fashion-GPT, a system paradigm that integrates ChatGPT with a pool of AI models in the fashion domain to achieve a multi-round multi-modal search. Specifically, it enables the system to utilize the LLMs for understanding user queries, select retrieval models based on their function descriptions, execute each subtask with the selected fashion models, and leverage LLMs to summarize the response corresponding to the execution results. In order to boost the performance dominated by AI experts, we also introduce a novel pre-trained framework called 3M (short for Multi-view Multi-modal Matching). In particular, unlike prior studies that rely solely on one-to-one matching on image-text pair, 3M incorporates multiple texts describing the same image to achieve one-to-many alignment. Maximizing mutual information between features extracted from these views boosts capturing information about high-level factors that influence multiple views, such as the occurrence of specific objects. In addition, with the advantage of the characteristics of fashion data, multi-view images from the same product, like front-view and side-view, are naturally suitable for intra-modal self-alignment. Therefore, 3M also introduces an intra-modal contrastive objective to provide additional benefits in representation learning from the image perspective. To the best of our knowledge, our framework is the first to consider one-to-many mapping for multi-modality representation learning. Experimental evaluations demonstrate that our fashion experts are competitive and achieve state-of-the-art performance, bringing a +3.47% R@10 boost on Fashion-200K and +1.98% R@10 boost on the Fashion-IQ dress dataset compared to the previous SOTA results.

查看译文

关键词

ChatGPT-based system with retrieval function,multimodal pre-training network,multi-round multi-modal search

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要