Long-term Recurrent Convolutional Networks for Visual Recognition and Description
IEEE transactions on pattern analysis and machine intelligence, Volume 39, Issue 4, 2016, Pages 677-691.
EI WOS
Keywords:
Weibo:
Abstract:
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visu...More
Code:
Data:
Introduction
- Recognition and description of images and videos is a fundamental challenge of computer vision.
- CNN models for video processing have successfully considered learning of 3-D spatio-temporal filters over raw sequence data [13, 2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [16, 33]
- Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling.
Highlights
- Recognition and description of images and videos is a fundamental challenge of computer vision
- We show here that long-term recurrent convolutional models are generally applicable to visual time-series modeling; we argue that in visual tasks where static or flat temporal models have previously been employed, long-term Recurrent Neural Networks can provide significant improvement when ample training data are available to learn or refine the representation
- Traditional Recurrent Neural Networks (Figure 2, left) can learn complex temporal dynamics by mapping input sequences to a sequence of hidden states, and hidden states to outputs via the following recurrence equations (Figure 2, left): ht = g(Wxhxt + Whhht 1 + bh) zt = g(Whzht + bz) where g is an element-wise non-linearity, such as a sigmoid or hyperbolic tangent, xt is the input, ht 2 RN is the hidden state with hidden units, and is the output at time
- We’ve presented long-term recurrent convolutional networks, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs
- Our results consistently demonstrate that by learning sequential dynamics with a deep sequence model, we can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence
- The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Results
- The authors evaluate the image description model for retrieval and generation tasks. The authors first show the effectiveness of the model by quantitatively evaluating it on the image and caption retrieval tasks proposed by [26] and seen in [25, 15, 36, 8, 18].
- The results show that (1) the LSTM outperforms an SMT-based approach to video description; (2) the simpler decoder architecture (b) and (c) achieve better performance than (a), likely because the input does not need to be memorized; and (3) the approach achieves 28.8%, clearly outperforming the best reported number of 26.9% on TACoS multilevel by [29].
Conclusion
- The authors have presented LRCN, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.
- The authors' results consistently demonstrate that by learning sequential dynamics with a deep sequence model, the authors can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.
- The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Summary
Introduction:
Recognition and description of images and videos is a fundamental challenge of computer vision.- CNN models for video processing have successfully considered learning of 3-D spatio-temporal filters over raw sequence data [13, 2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [16, 33]
- Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling.
Results:
The authors evaluate the image description model for retrieval and generation tasks. The authors first show the effectiveness of the model by quantitatively evaluating it on the image and caption retrieval tasks proposed by [26] and seen in [25, 15, 36, 8, 18].- The results show that (1) the LSTM outperforms an SMT-based approach to video description; (2) the simpler decoder architecture (b) and (c) achieve better performance than (a), likely because the input does not need to be memorized; and (3) the approach achieves 28.8%, clearly outperforming the best reported number of 26.9% on TACoS multilevel by [29].
Conclusion:
The authors have presented LRCN, a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.- The authors' results consistently demonstrate that by learning sequential dynamics with a deep sequence model, the authors can improve on previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.
- The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to produce with little input preprocessing and no hand-designed features
Tables
- Table1: Activity recognition: Comparing single frame models to LRCN networks for activity recognition in the UCF-101 [<a class="ref-link" id="c37" href="#r37">37</a>] dataset, with both RGB and flow inputs. Values for split-1 as well as the average across all three splits are shown. Our LRCN model consistently and strongly outperforms a model based on predictions from the underlying convolutional network architecture alone. On split-1, we show that placing the LSTM on fc6 performs better than fc7
- Table2: Image description: retrieval results for the Flickr30k [<a class="ref-link" id="c28" href="#r28">28</a>] and COCO2014 [<a class="ref-link" id="c24" href="#r24">24</a>] datasets. R@K is the average recall at rank K (high is good). Medr is the median rank (low is good). Note that [<a class="ref-link" id="c18" href="#r18">18</a>] achieves better retrieval performance using a stronger CNN architecture see text
- Table3: Image description: Sentence generation results (BLEU scores (%) – ours are adjusted with the brevity penalty) for the Flickr30k [<a class="ref-link" id="c28" href="#r28">28</a>] and COCO 2014 [<a class="ref-link" id="c24" href="#r24">24</a>] test sets
- Table4: Image description: Human evaluator rankings from 1-6 (low is good) averaged for each method and criterion. We evaluated on 785 Flickr images selected by the authors of [<a class="ref-link" id="c18" href="#r18">18</a>] for the purposes of comparison against this similar contemporary approach
- Table5: Video description: Results on detailed description of TACoS multilevel[<a class="ref-link" id="c29" href="#r29">29</a>], in %, see Section C.3 for details
Funding
- This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Vision and Learning Center
- The GPUs used for this research were donated by the NVIDIA Corporation. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD)
Reference
- M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Action classification in soccer videos with long short-term memory recurrent neural networks. In ICANN. 2010. 4, 5
- M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding. 2011. 2, 4, 5
- A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video in sentences out. In UAI, 2012. 7
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV. 2005
- K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014. 2, 3, 7
- P. Das, C. Xu, R. Doell, and J. Corso. Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013. 7
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 5
- A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In NIPS, 2013. 6, 7
- A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. 3
- A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014. 2, 3
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013. 7
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997. 2, 3
- S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):221– 231, 202, 4
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 202, 5, 6
- A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014. 6, 7
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 2, 4, 5, 12
- M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused video description. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011. 7
- R. Kiros, R. Salakhuditnov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. 6, 7
- R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, 2014. 6, 15
- R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neural language models. In Proc. NIPS Deep Learning Workshop, 2013. 6
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine translation. In ACL, 2007. 7, 8
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 4, 5, 6
- P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2(10):351–362, 2014. 7
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014. 6, 7, 13, 16
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014. 6, 7
- P. Y. Micah Hodosh and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853–899, 2013. 6
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002. 6
- M. H. Peter Young, Alice Lai and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014. 6, 7
- A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In GCPR, 2014. 8, 18, 20
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013. 2, 7, 8
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. 2
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014. 5, 6
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199, 2014. 2, 4, 5, 12
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 4
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 6
- R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional semantics for finding and describing images with sentences. In NIPS Deep Learning Workshop, 2013. 6, 7
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5, 6, 14
- I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011. 2
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 2, 3, 7
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. 4
- C. C. Tan, Y.-G. Jiang, and C.-W. Ngo. Towards textually describing complex video contents with audio-visual concept classifiers. In MM, 2011. 7
- J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, 2014. 7
- O. Vinyals, S. V. Ravuri, and D. Povey. Revisiting recurrent neural networks for robust ASR. In ICASSP, 2012. 2
- H. Wang, A. Klaser, C. Schmid, and C. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013. 8
- R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989. 2
- W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014. 3
- W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. 2, 4
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. 2014. 5
Full Text
Tags
Comments