Deep Reinforcement Learning-based Image Captioning with Embedding Reward

CVPR, 2017.

Cited by: 160|Bibtex|Views187
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
The policy network serves as a local guidance and the value network serves as a global and lookahead guidance

Abstract:

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates ...More

Code:

Data:

Introduction
  • The task of automatically describing the content of an image with natural language, has attracted increasingly interests in computer vision.
  • The value network, that evaluates the reward value of all possible extensions of the current state, serves as a global and lookahead guidance.
  • Such value network adjusts the goal of predicting the correct words towards the goal of generating captions that are similar to ground truth captions.
  • The two networks complement each other and are able to choose the word holding
Highlights
  • Image captioning, the task of automatically describing the content of an image with natural language, has attracted increasingly interests in computer vision
  • To learn our policy and value networks, we introduce an actor-critic reinforcement learning algorithm driven by visual-semantic embedding
  • We propose a novel lookahead inference mechanism that combines the local guidance of policy network and the global guidance of value network
  • We propose a lookahead inference that combines the policy network and value network to consider all options in Wt+1
  • We present a novel decision-making framework for image captioning, which achieves state-of-the-art performance on standard benchmark
  • The policy network serves as a local guidance and the value network serves as a global and lookahead guidance
Methods
  • DCC [13] utilized external data to prove its unique transfer capacity
  • It makes their results incomparable to other methods that do not use extra training data.
  • Considering the policy network shown in Figure 2 is based on a mechanism similar to the very basic image captioning model similar to Google NIC [44], such significant improvement over [44] validates the effectiveness of the proposed decision-making framework that utilizes both policy and value networks.
  • Other powerful mechanisms such as spatial attention, semantic attention can be directly integrated into the policy network and further improve the performance
Results
  • Extensive experiments on the Microsoft COCO dataset [29] show that the proposed method outperforms state-of-the-art approaches consistently across different evaluation metrics, including BLEU [34], Meteor [25], Rouge [28] and CIDEr [42].
  • The authors' method achieves state-of-the-art performance on the MS COCO dataset
Conclusion
  • The authors present a novel decision-making framework for image captioning, which achieves state-of-the-art performance on standard benchmark.
  • The policy network serves as a local guidance and the value network serves as a global and lookahead guidance.
  • To learn both networks, the authors use an actor-critic reinforcement learning approach with novel visual-semantic embedding rewards.
  • The authors' future works include improving network architectures and investigating the reward design by considering other embedding measures
Summary
  • Introduction:

    The task of automatically describing the content of an image with natural language, has attracted increasingly interests in computer vision.
  • The value network, that evaluates the reward value of all possible extensions of the current state, serves as a global and lookahead guidance.
  • Such value network adjusts the goal of predicting the correct words towards the goal of generating captions that are similar to ground truth captions.
  • The two networks complement each other and are able to choose the word holding
  • Methods:

    DCC [13] utilized external data to prove its unique transfer capacity
  • It makes their results incomparable to other methods that do not use extra training data.
  • Considering the policy network shown in Figure 2 is based on a mechanism similar to the very basic image captioning model similar to Google NIC [44], such significant improvement over [44] validates the effectiveness of the proposed decision-making framework that utilizes both policy and value networks.
  • Other powerful mechanisms such as spatial attention, semantic attention can be directly integrated into the policy network and further improve the performance
  • Results:

    Extensive experiments on the Microsoft COCO dataset [29] show that the proposed method outperforms state-of-the-art approaches consistently across different evaluation metrics, including BLEU [34], Meteor [25], Rouge [28] and CIDEr [42].
  • The authors' method achieves state-of-the-art performance on the MS COCO dataset
  • Conclusion:

    The authors present a novel decision-making framework for image captioning, which achieves state-of-the-art performance on standard benchmark.
  • The policy network serves as a local guidance and the value network serves as a global and lookahead guidance.
  • To learn both networks, the authors use an actor-critic reinforcement learning approach with novel visual-semantic embedding rewards.
  • The authors' future works include improving network architectures and investigating the reward design by considering other embedding measures
Tables
  • Table1: Performance of our method on MS COCO dataset, comparing with state-of-the-art methods. Our beam size is set to 10. For those competing methods, we show the results from their latest version of paper. The numbers in bold face are the best known results and (−) indicates unknown scores. (∗) indicates that external data was used for training in these methods
  • Table2: Performance of the variants of our method on MS COCO dataset, with beam size = 10. SL: supervised learning baseline. SL-
  • Table3: Evaluation of hyperparameter λ’s impact on our method
  • Table4: Evaluation of different beam sizes’ impact on SL baseline and our method
Download tables as Excel
Related work
  • 2.1. Image captioning

    Many image captioning approaches have been proposed in the literature. Early approaches tackled this problem using bottom-up paradigm [10, 23, 27, 47, 24, 8, 26, 9], which first generated descriptive words of an image by object recognition and attribute prediction, and then combined them by language models. Recently, inspired by the successful use of neural networks in machine translation [4], the encoder-decoder framework [3, 44, 30, 17, 7, 46, 15, 48, 43] has been brought to image captioning. Researchers adopted such framework because “translating” an image to a sentence was analogous to the task in machine translation. Approaches following this framework generally encoded an image as a single feature vector by convolutional neural networks [22, 6, 39, 41], and then fed such vector into recurrent neural networks [14, 5] to generate captions. On top of it, various modeling strategies have been developed. Karpathy and Fei-Fei [17], Fang et al [9] presented methods to enhance their models by detecting objects in images. To mimic the visual system of humans [20], spatial attention [46] and semantic attention [48] were proposed to direct the model to attend to the meaningful fine details. Dense captioning [16] was proposed to handle the localization and captioning tasks simultaneously. Ranzato et al [35] proposed a sequence-level training algorithm.
Reference
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009. 4
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. In arXiv:1504.00325, 2015. 5
    Findings
  • X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, 2015. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, 2012
    Google ScholarLocate open access versionFindings
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In arXiv:1412.3555, 2014. 2, 5
    Findings
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: a large-scale hierachical image database. In CVPR, 2009. 2
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013. 2
    Google ScholarFindings
  • H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From from captions to visual concepts and back. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 202
    Google ScholarLocate open access versionFindings
  • A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. 2, 4
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997. 2, 5
    Google ScholarLocate open access versionFindings
  • X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars. Guiding the longshort term memory model for image caption generation. In ICCV, 201, 2, 6
    Google ScholarLocate open access versionFindings
  • J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In CVPR, 202
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015. 1, 2, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 5
    Google ScholarLocate open access versionFindings
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. In TACL, 2015. 2, 4, 5
    Google ScholarLocate open access versionFindings
  • C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Matters of intelligence, pages 115– 141, 1987. 2
    Google ScholarFindings
  • V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS, 1999. 2
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 2
    Google ScholarLocate open access versionFindings
  • G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, 2012. 2
    Google ScholarLocate open access versionFindings
  • A. Lavie and M. Denkowski. The meteor metric for automatic evaluation of machine translation. Machine Translation, 2010. 2
    Google ScholarLocate open access versionFindings
  • R. Lebret, P. O. Pinheiro, and R. Collobert. Simple image description generator via a linear phrase-based approach. In ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, 2011. 2
    Google ScholarLocate open access versionFindings
  • C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In WAS, 2004. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollar. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with multimodal recurrent neural networks. In ICLR, 2015. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. 2
    Google ScholarLocate open access versionFindings
  • Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016. 4
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 2
    Google ScholarLocate open access versionFindings
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016. 2, 4, 6
    Google ScholarLocate open access versionFindings
  • Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Multi-instance visualsemantic embedding. In arXiv:1512.06963, 2015. 2, 4
    Findings
  • Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Joint image-text representation by gaussian visual-semantic embedding. In ACM Multimedia, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 2, 5
    Google ScholarLocate open access versionFindings
  • R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 2000. 2, 4
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015. 2
    Google ScholarFindings
  • A. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. In arXiv:1610.02424, 2016. 1, 2
    Findings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015. 1, 2, 3, 6
    Google ScholarLocate open access versionFindings
  • R. Williams. simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. 2, 4
    Google ScholarLocate open access versionFindings
  • K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011. 2
    Google ScholarLocate open access versionFindings
  • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, and A. Gupta. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In arXiv:1609.05143, 2016. 2, 4
    Findings
Full Text
Your rating :
0

 

Tags
Comments