AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We report median rank, and recall rate at top K for all the retrieval experiments

Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.

CVPR, pp.3068-3076, (2017)

Cited by: 232|Views146
EI
Full Text
Bibtex
Weibo

Abstract

In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to find a joint embedd...More

Code:

Data:

0
Introduction
  • There are few things so fundamental to the human experience as food. Its consumption is intricately linked to the health, the feelings and the culture.
  • Even migrants starting a new life in a foreign country often hold on to their ethnic food longer than to their native language.
  • Vital as it is to the lives, food offers new perspectives on topical challenges in computer vision like finding representations that are robust to occlusion and deformation.
  • Far beyond applications solely in the realm of culinary arts, such a tool may be applied to the plethora of food images shared on social media to achieve insight into the significance of food and its preparation on public
Highlights
  • There are few things so fundamental to the human experience as food
  • The profusion of online recipe collections with usersubmitted photos presents the possibility of training machines to automatically understand food preparation by jointly analyzing ingredient lists, cooking instructions and food images
  • We evaluate Canonical Correlation Analysis (CCA) over mean ingredient word2vec and skip-instructions features as another baseline
  • We report median rank (MedR), and recall rate at top K (R@K) for all the retrieval experiments
  • We find that ingredient detectors emerge in different units in our embeddings, which are aligned across modalities
  • We present Recipe1M, the largest structured recipe dataset to date, the im2recipe problem, and neural embedding models with semantic regularization which achieve impressive results for the im2recipe task
Methods
  • The authors begin with the evaluation of the learned embeddings for the im2recipe retrieval task.
  • The authors study the effect of each component of the model and compare the final system against human performance.
  • The authors evaluate all the recipe representations for im2recipe retrieval.
  • The task is to retrieve its recipe from a collection of test recipes.
  • The authors perform recipe2im retrieval using the same setting.
  • All results are reported for the test set
Conclusion
  • The authors present Recipe1M, the largest structured recipe dataset to date, the im2recipe problem, and neural embedding models with semantic regularization which achieve impressive results for the im2recipe task.
  • The methods presented here could be gainfully applied to other “recipes” like assembly instructions, tutorials, and industrial processes.
  • The authors hope that the contributions will support the creation of automated tools for food and recipe understanding and open doors for many less explored aspects of learning such as compositional creativity and predicting visual outcomes of action sequences
Summary
  • Introduction:

    There are few things so fundamental to the human experience as food. Its consumption is intricately linked to the health, the feelings and the culture.
  • Even migrants starting a new life in a foreign country often hold on to their ethnic food longer than to their native language.
  • Vital as it is to the lives, food offers new perspectives on topical challenges in computer vision like finding representations that are robust to occlusion and deformation.
  • Far beyond applications solely in the realm of culinary arts, such a tool may be applied to the plethora of food images shared on social media to achieve insight into the significance of food and its preparation on public
  • Objectives:

    The authors limit the effect of semantic regularization as it is not the main problem that the authors aim to solve.
  • Methods:

    The authors begin with the evaluation of the learned embeddings for the im2recipe retrieval task.
  • The authors study the effect of each component of the model and compare the final system against human performance.
  • The authors evaluate all the recipe representations for im2recipe retrieval.
  • The task is to retrieve its recipe from a collection of test recipes.
  • The authors perform recipe2im retrieval using the same setting.
  • All results are reported for the test set
  • Conclusion:

    The authors present Recipe1M, the largest structured recipe dataset to date, the im2recipe problem, and neural embedding models with semantic regularization which achieve impressive results for the im2recipe task.
  • The methods presented here could be gainfully applied to other “recipes” like assembly instructions, tutorials, and industrial processes.
  • The authors hope that the contributions will support the creation of automated tools for food and recipe understanding and open doors for many less explored aspects of learning such as compositional creativity and predicting visual outcomes of action sequences
Tables
  • Table1: Recipe1M dataset. Number of samples in training, validation and test sets
  • Table2: Table 2
  • Table3: Ablation studies. Effect of the different model components to the median rank (the lower is better)
  • Table4: Comparison with human performance on im2recipe task. The mean results are highlighted as bold for better visualization. Note that on average our method with semantic regularization performs better than average AMT worker
Download tables as Excel
Funding
  • This work has been supported by CSAIL-QCRI collaboration projects and the framework of projects TEC201343935-R and TEC2016-75976-R, financed by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF)
Study subjects and analysis
recipeimage pairs: 1000
We adopt the test procedure from image2caption retrieval task [7, 22]. We report results on a subset of randomly selected 1,000 recipeimage pairs from the test set. We repeat the experiments 10 times and report the mean results

Reference
  • L. Bossard, M. Guillaumin, and L. Van Gool. Food-101– mining discriminative components with random forests. In European Conference on Computer Vision, pages 446–46Springer, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016. 5
    Google ScholarLocate open access versionFindings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2016
    Findings
  • V. R. K. Garimella, A. Alfayad, and I. Weber. Social media image analysis for public health. In CHI, pages 5543–5547, 2016. 1
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2011, 3
    Findings
  • C.-w. N. Jing-jing Chen. Deep-based ingredient recognition for cooking recipe retrival. ACM Multimedia, 2012
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015. 6
    Google ScholarLocate open access versionFindings
  • Y. Kawano and K. Yanai. Foodcam: A real-time food recognition system on a smartphone. Multimedia Tools and Applications, 74(14):5263–5287, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. In NIPS, pages 3294–3302, 2015. 3, 6
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1
    Google ScholarLocate open access versionFindings
  • T. Kusmierczyk, C. Trattner, and K. Norvag. Understanding and predicting online food recipe production patterns. In HyperText, 2016. 2
    Google ScholarLocate open access versionFindings
  • Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014. 6
    Findings
  • C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, and Y. Ma. Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In International Conference on Smart Homes and Health Telematics, pages 37–48. Springer, 2016. 1
    Google ScholarLocate open access versionFindings
  • Y. Mejova, S. Abbar, and H. Haddadi. Fetishizing food in digital age: #foodporn around the world. In ICWSM, pages 250–258, 2016. 1
    Google ScholarLocate open access versionFindings
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. 3, 6, 8
    Findings
  • A. Myers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban, N. Silberman, S. Guadarrama, G. Papandreou, J. Huang, and K. Murphy. Im2calories: Towards an automated mobile vision food diary. In ICCV, pages 1233–1241, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • F. Ofli, Y. Aytar, I. Weber, R. Hammouri, and A. Torralba. Is saki #delicious? the food perception gap on instagram and its relation to health. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 201
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 8
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 1
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 3
    Findings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014. 3
    Google ScholarLocate open access versionFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015. 6
    Google ScholarLocate open access versionFindings
  • X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso. Recipe recognition with large multimodal food dataset. In ICME Workshops, pages 1–6, 2015. 2
    Google ScholarLocate open access versionFindings
  • R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain. Geolocalized modeling for dish recognition. IEEE Trans. Multimedia, 17(8):1187–1199, 2015. 2
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. International Conference on Learning Representations, 2015. 8
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014. 1
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科