Sequence to sequence-video to text
Weibo:
Abstract:
Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate capti...More
Code:
Data:
Introduction
- Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40].
- Video description has so far seen less attention despite its important applications in human-robot interaction, video indexing, and describing movies for the blind.
- While image description handles a variable length output sequence of words, video description has to handle a variable length input sequence of frames.
- The authors' model is sequence to sequence in a sense that it reads in frames
Highlights
- Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40]
- We report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37]
- While Microsoft Video Description corpus is based on web clips with short humanannotated sentences, MPII Movie Description Corpus and Montreal Video Annotation Dataset contain Hollywood movie snippets with descriptions sourced from script data and audio description
- This paper proposed a novel approach to video description
- Our model achieves state-of-the-art performance on the Microsoft Video Description corpus dataset, and outperforms related work on two large and challenging movie-description datasets
- Our model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Methods
- This secction describes the evaluation of the approach. The authors first describe the datasets used, the evaluation protocol, and the details of the models.
4.1. - The authors report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37].
- Together they form the largest parallel corpora with open domain video and natural language descriptions.
Results
- Results and Discussion
This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
5.1. - This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
- Table 2 shows the results on the MSVD dataset.
- The authors' basic S2VT AlexNet model on RGB video frames achieves 27.9% METEOR and improves over the basic mean-pooled model in [39] as well as the VGG meanpooled model;suggesting that S2VT is a more powerful approach.
- Our S2VT model which uses flow images achieves only 24.3% METEOR but improves the performance of the VGG model from 29.2% to 29.8%, when combined.
Conclusion
- In contrast to related work, the authors construct descriptions using a sequence to sequence model, where frames are first read sequentially and words are generated sequentially.
- This allows them to handle variable-length input and output while simultaneously modeling temporal structure.
- The authors' model achieves state-of-the-art performance on the MSVD dataset, and outperforms related work on two large and challenging movie-description datasets.
- The authors' model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Summary
Introduction:
Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40].- Video description has so far seen less attention despite its important applications in human-robot interaction, video indexing, and describing movies for the blind.
- While image description handles a variable length output sequence of words, video description has to handle a variable length input sequence of frames.
- The authors' model is sequence to sequence in a sense that it reads in frames
Methods:
This secction describes the evaluation of the approach. The authors first describe the datasets used, the evaluation protocol, and the details of the models.
4.1.- The authors report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37].
- Together they form the largest parallel corpora with open domain video and natural language descriptions.
Results:
Results and Discussion
This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
5.1.- This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
- Table 2 shows the results on the MSVD dataset.
- The authors' basic S2VT AlexNet model on RGB video frames achieves 27.9% METEOR and improves over the basic mean-pooled model in [39] as well as the VGG meanpooled model;suggesting that S2VT is a more powerful approach.
- Our S2VT model which uses flow images achieves only 24.3% METEOR but improves the performance of the VGG model from 29.2% to 29.8%, when combined.
Conclusion:
In contrast to related work, the authors construct descriptions using a sequence to sequence model, where frames are first read sequentially and words are generated sequentially.- This allows them to handle variable-length input and output while simultaneously modeling temporal structure.
- The authors' model achieves state-of-the-art performance on the MSVD dataset, and outperforms related work on two large and challenging movie-description datasets.
- The authors' model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Tables
- Table1: Corpus Statistics. The the number of tokens in all datasets are comparable, however MSVD has multiple descriptions for each video while the movie corpora (MPII-MD, MVAD) have a large number of clips with a single description each. Thus, the number of video-sentence pairs in all 3 datasets are comparable
- Table2: MSVD dataset (METEOR in %, higher is better)
- Table3: Percentage of generated sentences which match a sentence of the training set with an edit (Levenshtein) distance of less than 4. All values reported in percentage (%)
- Table4: MPII-MD dataset (METEOR in %, higher is better)
- Table5: M-VAD dataset (METEOR in %, higher is better)
Related work
- Early work on video captioning considered tagging videos with metadata [1] and clustering captions and videos [14, 25, 42] for retrieval tasks. Several previous methods for generating sentence descriptions [11, 19, 36] used a two stage pipeline that first identifies the semantic content (subject, verb, object) and then generates a sentence based on a template. This typically involved training individual classifiers to identify candidate objects, actions and scenes. They then use a probabilistic graphical model to combine the visual confidences with a language model in order to estimate the most likely content (subject, verb, object, scene) in the video, which is then used to generate a sentence. While this simplified the problem by detaching content generation and surface realization, it requires selecting a set of relevant objects and actions to recognize. Moreover, a template-based approach to sentence generation is insufficient to model the richness of language used in human descriptions – e.g., which attributes to use and how to combine them effectively to generate a good description. In contrast, our approach avoids the separation of content identification and sentence generation by learning to directly map videos to full human-provided sentences, learning a language model simultaneously conditioned on visual features.
Funding
- We acknowledge support from ONR ATL Grant N00014-11-1-010, DARPA, AFRL, DoD MURI award N000141110688, DEFT program (AFRL grant FA875013-2-0026), NSF awards IIS-1427425, IIS-1451244, and IIS-1212798, and BVLC
- Raymond and Kate acknowledge support from Google
- Marcus was supported by the FITweltweit-Program of the German Academic Exchange Service (DAAD)
Reference
- H. Aradhye, G. Toderici, and J. Yagnik. Video2text: Learning to annotate video content. In ICDMW, 2009. 2
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, pages 25–36, 2004. 2, 4
- D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. 2, 5
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 5
- X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. CVPR, 2011
- K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv:1409.1259, 2014. 3
- M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In EACL, 2014. 5
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 1, 2, 3, 4
- G. Gkioxari and J. Malik. Finding action tubes. 2014. 4
- A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014. 1
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013. 1, 2
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997. 1, 3
- P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL, 2014. 6
- H. Huang, Y. Lu, F. Zhang, and S. Sun. A multi-modal clustering method for web videos. In ISCTCS. 2013. 2
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. ACMMM, 2014. 2
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015. 1
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 7
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, 2014. 1
- N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, July 2013. 2
- P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. In TACL, 2014. 1
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, 2004. 5
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (mrnn). arXiv:1412.6632, 2014. 1
- J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. CVPR, 2015. 3, 4
- P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, B. Shaw, A. F. Smeaton, and G. Queenot. TRECVID 2012 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012, 2012. 2
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 5
- A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. GCPR, 2015. 7
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In CVPR, 2015. 1, 2, 5, 7
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013. 1
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ILSVRC, 2014. 4
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 2
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 4
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. ICML, 2015. 2
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 1, 2, 3
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015. 6
- J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, 2014. 2, 6
- A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070v1, 2015. 2, 5
- R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. CVPR, 2015. 5
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015. 1, 2, 4, 5, 6, 7
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR, 2015. 1, 2, 4
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558. IEEE, 2013. 2
- S. Wei, Y. Zhao, Z. Zhu, and N. Liu. Multimodal fusion for video search reranking. TKDE, 2010. 2
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. arXiv:1502.08029v4, 2015. 1, 2, 4, 6, 7, 8
- W. Zaremba and I. Sutskever. Learning to execute. arXiv:1410.4615, 2014. 3
Full Text
Tags
Comments