Toward Chinese Food Understanding: a Cross-Modal Ingredient-Level Benchmark

Lanjun Wang,Chenyu Zhang,An-An Liu, Bo Yang, Mingwang Hu, Xinran Qiao, Lei Wang, Jianlin He, Qiang Liu

IEEE Transactions on Multimedia(2024)

Cited 0|Views4
No score
Although there are several food-level benchmarks for food-related learning, the lack of fine-grained ingredient annotation significantly impedes progress in food scene understanding. In this study, we focus on Chinese food understanding which involves fine-grained ingredient detection and cross-modal ingredient retrieval. Specifically, to support studies on Chinese food understanding, we build the first cross-modal ingredientlevel dataset called CMIngre, which contains 8,001 image-text pairs from three different sources, i.e. dishes, recipes, and usergenerated content, covering 429 distinct ingredients and 95,290 bounding boxes. Based on CMIngre, we evaluate the performance of traditional CNN-based detection algorithms and transformerbased pre-trained large models for ingredient detection. We also propose baseline methods for the cross-modal ingredient retrieval task in both the end-to-end and two-stage settings. Extensive experiments on CMIngre demonstrate the effectiveness of our proposed methods on food understanding
Translated text
Key words
Food-related benchmark,cross-modal food understanding,ingredient detection,cross-modal ingredient retrieval
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined