FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

Jin Linbo
Jin Linbo
Chen Ben
Chen Ben
Wei Yi
Wei Yi
Hu Yi
Hu Yi
Wang Hao
Wang Hao

international acm sigir conference on research and development in information retrieval, 2020.

Cited by: 1|Bibtex|Views82
Other Links: arxiv.org|academic.microsoft.com
Weibo:
Each experiment runs three times, and the average performances are shown in Table 1. we observe that FashionBERT with patch and adaptive loss achieves the significant improvement on Rank@K metrices

Abstract:

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry. Different from the matching in the general domain, the fashion matching is required to pay much more attention to the fine-grained information in the fashion images and texts. Pioneer approaches detect the region of interests (i.e., R...More

Code:

Data:

0
Introduction
  • A great number of multimedia data have been emerged on The internet.
  • The proposed approaches have achieved promising performances on several down-stream tasks, such as cross-modal retrieval [40], image captioning [1] and visual question answering [2].
  • These studies are centered on text and image matching of the general domain.
  • The authors focus on the text and image matching of the fashion industry1, which is mainly referred to clothing, footwear, accessories, makeup and etc
Highlights
  • Over the last decade, a great number of multimedia data have been emerged on The internet
  • The pre-training technique has been successfully applied in Compute Visual (CV) [1, 2] and Nature Language Processing (NLP) [8, 46]
  • We focus on the text and image matching of the fashion industry1, which is mainly referred to clothing, footwear, accessories, makeup and etc
  • Each experiment runs three times, and the average performances are shown in Table 1. we observe that FashionBERT with patch and adaptive loss achieves the significant improvement on Rank@K metrices
  • We find that when FashionBERT well matches the fashion texts and images, it shifts its attention on the Masked Patch Modeling (MPM) and Masked Language Modeling (MLM) tasks
  • We focus on the text and image matching in crossmodal retrieval of the fashion domain
Methods
  • The authors will briefly revisit the BERT language model and describe how the authors extract the image features and how FashionBERT jointly models the image and text data. 2.1 BERT

    The BERT model introduced by [8] is an attention-based bidirectional language model.
  • It is possible to select any pre-trained image model, (e.g., InceptionV3 [36] and ResNeXt-101 [43]) as the backbone of the patch network.
  • These patches are ordered in nature.
  • Matching Backbone: The concatenation of the text token sequence and image patch sequence consists of the FashionBERT inputs.
  • Similar to BERT, the special token [CLS] and separate token [SEP] are added in the first position and between the text token sequence and the image patch sequence, respectively
Results
  • They release the fine-tuned cross-modal retrieval model as well
  • In this experiment, the authors evaluate the Fashion-Gen testing data with the released ViLBERT model.
  • ViLBERT-Finetune: In this experiment, based on the pretrained ViLBERT, the authors fine-tune a new cross-modal retrieval model with the Fashion-Gen training data.
  • The authors observe that FashionBERT with patch and adaptive loss achieves the significant improvement on Rank@K metrices
  • This shows the excellent ability of FashionBERT in fashion text and image matching.
Conclusion
  • The authors focus on the text and image matching in crossmodal retrieval of the fashion domain.
  • The authors propose FashionBERT to address the matching issues in the fashion domain.
  • FashionBERT splits images into patches.
  • The main conclusions are 1) the patch method shows its advantages in matching fashion texts and images, compared with the objectlevel RoI method; 2) through the adaptive loss, FashionBERT shifts its attention on different tasks during the training procedure
Summary
  • Introduction:

    A great number of multimedia data have been emerged on The internet.
  • The proposed approaches have achieved promising performances on several down-stream tasks, such as cross-modal retrieval [40], image captioning [1] and visual question answering [2].
  • These studies are centered on text and image matching of the general domain.
  • The authors focus on the text and image matching of the fashion industry1, which is mainly referred to clothing, footwear, accessories, makeup and etc
  • Objectives:

    In Equation (5), on one hand the authors aim to minimum the total weighted loss, and on the other hand the authors expect FashionBERT.
  • Methods:

    The authors will briefly revisit the BERT language model and describe how the authors extract the image features and how FashionBERT jointly models the image and text data. 2.1 BERT

    The BERT model introduced by [8] is an attention-based bidirectional language model.
  • It is possible to select any pre-trained image model, (e.g., InceptionV3 [36] and ResNeXt-101 [43]) as the backbone of the patch network.
  • These patches are ordered in nature.
  • Matching Backbone: The concatenation of the text token sequence and image patch sequence consists of the FashionBERT inputs.
  • Similar to BERT, the special token [CLS] and separate token [SEP] are added in the first position and between the text token sequence and the image patch sequence, respectively
  • Results:

    They release the fine-tuned cross-modal retrieval model as well
  • In this experiment, the authors evaluate the Fashion-Gen testing data with the released ViLBERT model.
  • ViLBERT-Finetune: In this experiment, based on the pretrained ViLBERT, the authors fine-tune a new cross-modal retrieval model with the Fashion-Gen training data.
  • The authors observe that FashionBERT with patch and adaptive loss achieves the significant improvement on Rank@K metrices
  • This shows the excellent ability of FashionBERT in fashion text and image matching.
  • Conclusion:

    The authors focus on the text and image matching in crossmodal retrieval of the fashion domain.
  • The authors propose FashionBERT to address the matching issues in the fashion domain.
  • FashionBERT splits images into patches.
  • The main conclusions are 1) the patch method shows its advantages in matching fashion texts and images, compared with the objectlevel RoI method; 2) through the adaptive loss, FashionBERT shifts its attention on different tasks during the training procedure
Tables
  • Table1: Comparison of FashionBERT with the baseline and SOTA pre-trained approaches
  • Table2: Evaluation of patch feature extraction, where “V3” and “RNX” refer to the pre-trained inceptionV3 and ResNeXt-101models useful information from these noise regions. On the contrary, the patches provide non-repeated and reasonably-related information, which is more suitable for self-supervised learning and enhances the performance of the fashion text and image matching
  • Table3: Evaluation of model size, where L denotes Layer
  • Table4: Evaluation of FashionBERT in fine-tuning. All approaches are tested on the Intel(R) Xeon(R) E5-2650 servers
Download tables as Excel
Related work
  • 4.1 Pre-training

    The pre-training technique recently has been widely adopted in Machine Learning, which allows the learning model to leverage information from other related tasks.

    The pre-training technique becomes popular first in CV. Krizhevsky et al propose AlexNet in 2012 [28], with which they win the 2012 ILSVR image classification competition [7]. Later on, the researchers found that these CNN blocks in AlexNet pretrained on ImageNet or the other large-scale image corpus can be treated as general feature extractors and perform well in various of downstream tasks [9]. Since then the researchers propose more effective CNN-based models and pre-train them on massive dataset, such as VGG [33], Google Inception [36], ResNet [43].

    Accuracy AUC Latency(ms)
Reference
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould and Lei Zhang, 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick and Devi Parikh, 2015. VQA: Visual Question Answering. In Proceedings of International Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press, New York, NY USA.
    Google ScholarFindings
  • Huizhong Chen, Andrew Gallagher and Bernd Girod, 2012. Describing clothing by semantic attributes. In Proceedings of European Conference on Computer Vision, 609–623.
    Google ScholarLocate open access versionFindings
  • Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin and Jingren Zhou, 2020. AdaBERT: TaskAdaptive BERT Compression with Differentiable Neural Architecture Search. arXiv preprint arXiv:2001.04246.
    Findings
  • Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang and Guoping Hu, 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv. preprint arXiv:1906.08101.
    Findings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei, 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, 201BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. preprint arXiv:1810.04805.
    Findings
  • Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng and Trevor Darrell, 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. arXiv:1310.1531.
    Findings
  • Fartash Faghri, David J. Fleet, Jamie Ryan Kiros and Sanja Fidler, 2017. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv preprint arXiv:1707.05612.
    Findings
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato and Tomas Mikolov, 2013. DeViSE: A Deep VisualSemantic Embedding Model. In Proceedings of Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Ross Fast r-cnn. In Proceedings of IEEE International Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • David R. Hardoon, Sandor Szedmak and John Shawe-Taylor, 2004. Canonical Correlation Analysis: An overview with Application to Learning Methods. Neural Computation. Vol.16(12), 2639-2664.
    Google ScholarLocate open access versionFindings
  • Herve Jégou, Matthijs Douze and Cordelia Schmid, 2011. Product Quantization for Nearest Neighbor Search. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33(1), 117-128.
    Google ScholarLocate open access versionFindings
  • Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang and Qun Liu, 2019. TinyBERT: Distilling BERT for Natural Language
    Google ScholarFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma and Radu Soricut, 2019. ALBERT: A Lite BERT for Self-supervised
    Google ScholarFindings
  • Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu and Xiaodong He, 20Stacked Cross Attention for Image-Text Matching. In Proceedings of European Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang and Ming Zhou, 20Unicoder-VL: A Universal Encoder for Vision and Language by Crossmodal Pre-training. In Proceedings of Association for the Advancement of Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu and Shuicheng Yan, 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition. 3330–3337.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang and Xiaoou Tang, 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition. 1096–1104.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer and Trevor Darrell, 2015. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh and Stefan Lee, 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265.
    Findings
  • M Hadi Kiapour, Kota Yamaguchi, Alexander C Berg, and Tamara L Berg, 2014. Hipster wars: Discovering elements of fashion styles. In Proceedings of European Conference on Computer Vision. 472–488.
    Google ScholarLocate open access versionFindings
  • Tae-Kyun Kim, Josef Kitter and Roberto Cipolla, 2007. Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):10051018.
    Google ScholarLocate open access versionFindings
  • Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya, 2020. Reformer: The Efficient Transformer. In The International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein and Fei-Fei Li, 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprint arXiv:1602.07332.
    Findings
  • Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Annual Conference on Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Ishan Misra, C Lawrence Zitnick and Martial Hebert, 2016. Shuffle and learn: unsupervised learning using temporal order verification. In Proceedings of European Conference on Computer Vision, 527–544.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher and Christopher Manning, 2014. Glove: Global vectors for word representation. In Proceedings of conference on Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever, 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Jie Shao, Leiquan Wang, Zhicheng Zhao Fei Su and Anni Cai, 2016. Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval. Neurocomputing, 214:618-628.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman, 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556.
    Findings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei and Jifeng Dai, 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv. preprint arXiv:1908.08530.
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy and Cordelia Schmid, 2019. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766.
    Findings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens and Zbigniew Wojna, 2016. Rethinking the Inception Architecture for Computer Vision. arXiv preprint arXiv:1512.00567.
    Findings
  • Lorenzo Torresani, Martin Szummer and Andrew Fitzgibbon, 2010. Efficient object category recognition using classemes. ECCV, 776–789.
    Google ScholarLocate open access versionFindings
  • Trieu H. Trinh, Minh-Thang Luong and Quoc V. Le, 2019. Selfie: Selfsupervised Pretraining for Image Embedding. arXiv preprint arXiv:1906.02940.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin, 2017. Attention Is All You Need. arXiv, preprint arXiv:1706.03762.
    Findings
  • Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei and James Hays, 2019. Composing Text and Image for Image Retrieval – An Empirical Odyssey. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li and Xin Fan, 2019. Position Focused Attention Network for Image-Text Matching. In Proceedings of International Joint Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi and et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu and Kaiming He, 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431.
    Findings
  • Fei Yan and Krystian Mikolajczyk, 2015. Deep Correlation for Matching Images and Text. In Proceedings of IEEE Conference of Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Artem Babenko Yandex, and Victor Lempitsky, 2016. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. In IEEE Conference on Computer Vision and Pattern Recognition. 2055-2063.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, and Quoc VLe, 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
Full Text
Your rating :
0

 

Tags
Comments