From captions to visual concepts and back

computer vision and pattern recognition, Volume abs/1411.4952, 2015, Pages 1473-1482.

被引用1004|引用|浏览315
EI
其它链接dblp.uni-trier.de|academic.microsoft.com|arxiv.org
微博一下
We describe the datasets used for testing, followed by an evaluation of our approach for word detection and experimental results on sentence generation

摘要

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts ...更多

代码

数据

0
简介
  • One definition is when it can generate a novel caption that summarizes the salient content within an image.
  • This content may include objects that are present, their attributes, or their relations with each other.
  • K. Srivastava contributed to this work while doing internships at Microsoft Research.
  • Fang: University of Washington; S.
  • Iandola: University of California at Berkeley; R.
  • K. Srivastava: IDSIA, USI-SUPSI
重点内容
  • When does a machine “understand” an image? One definition is when it can generate a novel caption that summarizes the salient content within an image
  • In addition to several common sentence features, we introduce a new feature based on a Deep Multimodal Similarity Model (DMSM)
  • We describe the datasets used for testing, followed by an evaluation of our approach for word detection and experimental results on sentence generation
  • The images create a challenging testbed for image captioning since most images contain multiple objects and significant contextual information
  • We provide several baselines for experimental comparison, including two baselines that measure the complexity of the dataset: Unconditioned, which generates sentences by sampling an N gram language model without knowledge of the visual word detectors; and Shuffled Human, which randomly picks another human generated caption from another image
  • We use a global deep multimodal similarity model introduced in this paper to re-rank candidate captions
结果
  • The authors describe the datasets used for testing, followed by an evaluation of the approach for word detection and experimental results on sentence generation. 6.1.
  • The authors describe the datasets used for testing, followed by an evaluation of the approach for word detection and experimental results on sentence generation.
  • Most of the results are reported on the Microsoft COCO dataset [28, 4].
  • The dataset contains 82,783 training images and 40,504 validation images.
  • The images create a challenging testbed for image captioning since most images contain multiple objects and significant contextual information.
  • The COCO dataset provides 5 human-annotated captions per image.
  • The test annotations are not available, so the authors split the validation set into validation and test sets4
结论
  • The system trains on images and corresponding captions, and learns to extract nouns, verbs, and adjectives from regions in the image.
  • These detected words guide a language model to generate text that reads well and includes the detected words.
  • At the time of writing, the system is state-of-the-art on all 14 official metrics of the COCO image captioning task, and equal to or exceeding human performance on 12 out of the 14 official metrics.
  • The authors' generated captions have been judged by humans (Mechanical Turk workers) to be equal to or better than human-written captions 34% of the time
总结
  • Introduction:

    One definition is when it can generate a novel caption that summarizes the salient content within an image.
  • This content may include objects that are present, their attributes, or their relations with each other.
  • K. Srivastava contributed to this work while doing internships at Microsoft Research.
  • Fang: University of Washington; S.
  • Iandola: University of California at Berkeley; R.
  • K. Srivastava: IDSIA, USI-SUPSI
  • Results:

    The authors describe the datasets used for testing, followed by an evaluation of the approach for word detection and experimental results on sentence generation. 6.1.
  • The authors describe the datasets used for testing, followed by an evaluation of the approach for word detection and experimental results on sentence generation.
  • Most of the results are reported on the Microsoft COCO dataset [28, 4].
  • The dataset contains 82,783 training images and 40,504 validation images.
  • The images create a challenging testbed for image captioning since most images contain multiple objects and significant contextual information.
  • The COCO dataset provides 5 human-annotated captions per image.
  • The test annotations are not available, so the authors split the validation set into validation and test sets4
  • Conclusion:

    The system trains on images and corresponding captions, and learns to extract nouns, verbs, and adjectives from regions in the image.
  • These detected words guide a language model to generate text that reads well and includes the detected words.
  • At the time of writing, the system is state-of-the-art on all 14 official metrics of the COCO image captioning task, and equal to or exceeding human performance on 12 out of the 14 official metrics.
  • The authors' generated captions have been judged by humans (Mechanical Turk workers) to be equal to or better than human-written captions 34% of the time
表格
  • Table1: Features used in the maximum entropy language model
  • Table2: Features used by MERT
  • Table3: Average precision (AP) and Precision at Human Recall (PHR) [<a class="ref-link" id="c4" href="#r4">4</a>] for words with different parts of speech (NN: Nouns, VB: Verbs, JJ: Adjectives, DT: Determiners, PRP: Pronouns, IN: Prepositions). Results are shown using a chance classifier, full image classification, and Noisy OR multiple instance learning with AlexNet [<a class="ref-link" id="c21" href="#r21">21</a>] and VGG [<a class="ref-link" id="c42" href="#r42">42</a>] CNNs
  • Table4: Caption generation performance for seven variants of our system on the Microsoft COCO dataset. We report performance on our held out test set (half of the validation set). We report Perplexity (PPLX), BLEU and METEOR, using 4 randomly selected caption references. Results from human studies of subjective performance are also shown, with error bars in parentheses. Our final System “VGG+Score+DMSM+ft” is “same or better” than human 34% of the time
  • Table5: Official COCO evaluation server results on test set (40,775 images). First row show results using 5 reference captions, second row, 40 references. Human results reported in parentheses
Download tables as Excel
相关工作
  • There are two well-studied approaches to automatic image captioning: retrieval of existing human-written captions, and generation of novel captions. Recent retrievalbased approaches have used neural networks to map images and text into a common vector representation [43]. Other retrieval based methods use similarity metrics that take predefined image features [15, 36]. Farhadi et al [12] represent both images and text as linguistically-motivated semantic triples, and compute similarity in that space. A similar finegrained analysis of sentences and images has been done for retrieval in the context of neural networks [19].

    Retrieval-based methods always return well-formed human-written captions, but these captions may not be able to describe new combinations of objects or novel scenes. This limitation has motivated a large body of work on generative approaches, where the image is first analyzed and objects are detected, and then a novel caption is generated. Previous work utilizes syntactic and semantic constraints in the generation process [32, 48, 26, 23, 22, 47], and we compare against prior state of the art in this line of work. We focus on the Midge system [32], which combines syntactic structures using maximum likelihood estimation to generate novel sentences; and compare qualitatively against the Baby Talk system [22], which generates descriptions by filling sentence template slots with words selected from a conditional random field that predicts the most likely image labeling. Both of these previous systems use the same set of test sentences, making direct comparison possible.
引用论文
  • S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005. 2, 6
    Google ScholarLocate open access versionFindings
  • A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996. 2, 4
    Google ScholarLocate open access versionFindings
  • A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 6, 7
    Findings
  • X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. In ICCV, 2013. 1
    Google ScholarLocate open access versionFindings
  • X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014. 1
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CVPR, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • D. Elliott and F. Keller. Comparing automatic evaluation measures for image description. In ACL, 2014. 6
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. IJCV, 88(2):303–338, June 2010. 2, 6
    Google ScholarLocate open access versionFindings
  • A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010. 2
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 3
    Google ScholarLocate open access versionFindings
  • G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In CVPR, 204
    Google ScholarLocate open access versionFindings
  • M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853–899, 2013. 2
    Google ScholarLocate open access versionFindings
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. 5
    Google ScholarFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 3
    Findings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679, 2014. 2
    Findings
  • R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013. 2, 3
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011. 1, 2, 6, 8
    Google ScholarLocate open access versionFindings
  • P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, 2012. 2
    Google ScholarLocate open access versionFindings
  • R. Lau, R. Rosenfeld, and S. Roukos. Trigger-based language models: A maximum entropy approach. In ICASSP, 1993. 4
    Google ScholarLocate open access versionFindings
  • R. Lebret, P. O. Pinheiro, and R. Collobert. Phrase-based image captioning. arXiv preprint arXiv:1502.03671, 2015. 2, 3
    Findings
  • S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, 2011. 2
    Google ScholarLocate open access versionFindings
  • C.-Y. Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL ’04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. 6
    Google ScholarLocate open access versionFindings
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014. 2, 3
    Findings
  • O. Maron and T. Lozano-Perez. A framework for multipleinstance learning. NIPS, 1998. 2, 3
    Google ScholarLocate open access versionFindings
  • T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky. Strategies for training large scale neural network language models. In ASRU, 2011. 4
    Google ScholarLocate open access versionFindings
  • M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daume III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. In ICML, 2007. 4
    Google ScholarLocate open access versionFindings
  • A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012. 4
    Google ScholarLocate open access versionFindings
  • F. J. Och. Minimum error rate training in statistical machine translation. In ACL, 2003. 2, 5
    Google ScholarLocate open access versionFindings
  • V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011. 2
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 2, 6
    Google ScholarLocate open access versionFindings
  • C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using Amazon’s mechanical turk. In NAACL HLT Workshop Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010. 2, 6, 8
    Google ScholarFindings
  • A. Ratnaparkhi. Trainable methods for surface natural language generation. In NAACL, 2000. 4
    Google ScholarLocate open access versionFindings
  • A. Ratnaparkhi. Trainable approaches to surface natural language generation and their application to conversational dialog systems. Computer Speech & Language, 16(3):435–455, 2002. 2, 3
    Google ScholarLocate open access versionFindings
  • Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. 5
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 2, 3, 6, 7
    Findings
  • R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional semantics for finding and describing images with sentences. In NIPS Deep Learning Workshop, 2013. 2
    Google ScholarFindings
  • R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. CoRR, abs/1411.5726, 2014. 6, 8
    Findings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015. 2, 3
    Findings
  • Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011. 1, 2
    Google ScholarLocate open access versionFindings
  • B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T: Image parsing to text description. Proceedings of the IEEE, 98(8):1485–1508, 2010. 2
    Google ScholarLocate open access versionFindings
  • C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In NIPS, 2005. 2, 3
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 8
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In CVPR, 2013. 1
    Google ScholarFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论