Sequence to sequence-video to text

S Venugopalan
S Venugopalan
M Rohrbach
M Rohrbach
R Mooney
R Mooney
T Darrell
T Darrell
Cited by: 853|Bibtex|Views123
Weibo:
Our model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets

Abstract:

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate capti...More

Code:

Data:

0
Introduction
  • Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40].
  • Video description has so far seen less attention despite its important applications in human-robot interaction, video indexing, and describing movies for the blind.
  • While image description handles a variable length output sequence of words, video description has to handle a variable length input sequence of frames.
  • The authors' model is sequence to sequence in a sense that it reads in frames
Highlights
  • Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40]
  • We report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37]
  • While Microsoft Video Description corpus is based on web clips with short humanannotated sentences, MPII Movie Description Corpus and Montreal Video Annotation Dataset contain Hollywood movie snippets with descriptions sourced from script data and audio description
  • This paper proposed a novel approach to video description
  • Our model achieves state-of-the-art performance on the Microsoft Video Description corpus dataset, and outperforms related work on two large and challenging movie-description datasets
  • Our model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Methods
  • This secction describes the evaluation of the approach. The authors first describe the datasets used, the evaluation protocol, and the details of the models.

    4.1.
  • The authors report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37].
  • Together they form the largest parallel corpora with open domain video and natural language descriptions.
Results
  • Results and Discussion

    This section discussses the result of the evaluation shown in Tables 2, 4, and 5.

    5.1.
  • This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
  • Table 2 shows the results on the MSVD dataset.
  • The authors' basic S2VT AlexNet model on RGB video frames achieves 27.9% METEOR and improves over the basic mean-pooled model in [39] as well as the VGG meanpooled model;suggesting that S2VT is a more powerful approach.
  • Our S2VT model which uses flow images achieves only 24.3% METEOR but improves the performance of the VGG model from 29.2% to 29.8%, when combined.
Conclusion
  • In contrast to related work, the authors construct descriptions using a sequence to sequence model, where frames are first read sequentially and words are generated sequentially.
  • This allows them to handle variable-length input and output while simultaneously modeling temporal structure.
  • The authors' model achieves state-of-the-art performance on the MSVD dataset, and outperforms related work on two large and challenging movie-description datasets.
  • The authors' model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Summary
  • Introduction:

    Describing visual content with natural language text has recently received increased interest, especially describing images with a single sentence [8, 5, 16, 18, 20, 23, 29, 40].
  • Video description has so far seen less attention despite its important applications in human-robot interaction, video indexing, and describing movies for the blind.
  • While image description handles a variable length output sequence of words, video description has to handle a variable length input sequence of frames.
  • The authors' model is sequence to sequence in a sense that it reads in frames
  • Methods:

    This secction describes the evaluation of the approach. The authors first describe the datasets used, the evaluation protocol, and the details of the models.

    4.1.
  • The authors report results on three video description corpora, namely the Microsoft Video Description corpus (MSVD) [3], the MPII Movie Description Corpus (MPII-MD) [28], and the Montreal Video Annotation Dataset (M-VAD) [37].
  • Together they form the largest parallel corpora with open domain video and natural language descriptions.
  • Results:

    Results and Discussion

    This section discussses the result of the evaluation shown in Tables 2, 4, and 5.

    5.1.
  • This section discussses the result of the evaluation shown in Tables 2, 4, and 5.
  • Table 2 shows the results on the MSVD dataset.
  • The authors' basic S2VT AlexNet model on RGB video frames achieves 27.9% METEOR and improves over the basic mean-pooled model in [39] as well as the VGG meanpooled model;suggesting that S2VT is a more powerful approach.
  • Our S2VT model which uses flow images achieves only 24.3% METEOR but improves the performance of the VGG model from 29.2% to 29.8%, when combined.
  • Conclusion:

    In contrast to related work, the authors construct descriptions using a sequence to sequence model, where frames are first read sequentially and words are generated sequentially.
  • This allows them to handle variable-length input and output while simultaneously modeling temporal structure.
  • The authors' model achieves state-of-the-art performance on the MSVD dataset, and outperforms related work on two large and challenging movie-description datasets.
  • The authors' model significantly benefits from additional data, suggesting that it has a high model capacity, and is able to learn complex temporal structure in the input and output sequences for challenging moviedescription datasets
Tables
  • Table1: Corpus Statistics. The the number of tokens in all datasets are comparable, however MSVD has multiple descriptions for each video while the movie corpora (MPII-MD, MVAD) have a large number of clips with a single description each. Thus, the number of video-sentence pairs in all 3 datasets are comparable
  • Table2: MSVD dataset (METEOR in %, higher is better)
  • Table3: Percentage of generated sentences which match a sentence of the training set with an edit (Levenshtein) distance of less than 4. All values reported in percentage (%)
  • Table4: MPII-MD dataset (METEOR in %, higher is better)
  • Table5: M-VAD dataset (METEOR in %, higher is better)
Download tables as Excel
Related work
  • Early work on video captioning considered tagging videos with metadata [1] and clustering captions and videos [14, 25, 42] for retrieval tasks. Several previous methods for generating sentence descriptions [11, 19, 36] used a two stage pipeline that first identifies the semantic content (subject, verb, object) and then generates a sentence based on a template. This typically involved training individual classifiers to identify candidate objects, actions and scenes. They then use a probabilistic graphical model to combine the visual confidences with a language model in order to estimate the most likely content (subject, verb, object, scene) in the video, which is then used to generate a sentence. While this simplified the problem by detaching content generation and surface realization, it requires selecting a set of relevant objects and actions to recognize. Moreover, a template-based approach to sentence generation is insufficient to model the richness of language used in human descriptions – e.g., which attributes to use and how to combine them effectively to generate a good description. In contrast, our approach avoids the separation of content identification and sentence generation by learning to directly map videos to full human-provided sentences, learning a language model simultaneously conditioned on visual features.
Funding
  • We acknowledge support from ONR ATL Grant N00014-11-1-010, DARPA, AFRL, DoD MURI award N000141110688, DEFT program (AFRL grant FA875013-2-0026), NSF awards IIS-1427425, IIS-1451244, and IIS-1212798, and BVLC
  • Raymond and Kate acknowledge support from Google
  • Marcus was supported by the FITweltweit-Program of the German Academic Exchange Service (DAAD)
Reference
  • H. Aradhye, G. Toderici, and J. Yagnik. Video2text: Learning to annotate video content. In ICDMW, 2009. 2
    Google ScholarFindings
  • T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, pages 25–36, 2004. 2, 4
    Google ScholarLocate open access versionFindings
  • D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. 2, 5
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015. 5
    Findings
  • X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. CVPR, 2011
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv:1409.1259, 2014. 3
    Findings
  • M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In EACL, 2014. 5
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. 1, 2, 3, 4
    Google ScholarLocate open access versionFindings
  • G. Gkioxari and J. Malik. Finding action tubes. 2014. 4
    Google ScholarFindings
  • A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014. 1
    Google ScholarLocate open access versionFindings
  • S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997. 1, 3
    Google ScholarLocate open access versionFindings
  • P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL, 2014. 6
    Google ScholarLocate open access versionFindings
  • H. Huang, Y. Lu, F. Zhang, and S. Sun. A multi-modal clustering method for web videos. In ISCTCS. 2013. 2
    Google ScholarFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. ACMMM, 2014. 2
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 7
    Findings
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, 2014. 1
    Findings
  • N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, July 2013. 2
    Google ScholarFindings
  • P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. In TACL, 2014. 1
    Google ScholarLocate open access versionFindings
  • C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, 2004. 5
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6
    Google ScholarLocate open access versionFindings
  • J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (mrnn). arXiv:1412.6632, 2014. 1
    Findings
  • J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. CVPR, 2015. 3, 4
    Google ScholarLocate open access versionFindings
  • P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, B. Shaw, A. F. Smeaton, and G. Queenot. TRECVID 2012 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012, 2012. 2
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 5
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. GCPR, 2015. 7
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In CVPR, 2015. 1, 2, 5, 7
    Google ScholarFindings
  • M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013. 1
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ILSVRC, 2014. 4
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 4
    Findings
  • N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. ICML, 2015. 2
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015. 6
    Google ScholarLocate open access versionFindings
  • J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070v1, 2015. 2, 5
    Findings
  • R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. CVPR, 2015. 5
    Google ScholarLocate open access versionFindings
  • S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015. 1, 2, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR, 2015. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558. IEEE, 2013. 2
    Google ScholarLocate open access versionFindings
  • S. Wei, Y. Zhao, Z. Zhu, and N. Liu. Multimodal fusion for video search reranking. TKDE, 2010. 2
    Google ScholarFindings
  • L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. arXiv:1502.08029v4, 2015. 1, 2, 4, 6, 7, 8
    Findings
  • W. Zaremba and I. Sutskever. Learning to execute. arXiv:1410.4615, 2014. 3
    Findings
Full Text
Your rating :
0

 

Tags
Comments