Translating Videos to Natural Language Using Deep Recurrent Neural Networks

HLT-NAACL, pp. 1494-1504, 2015.

Cited by: 673|Bibtex|Views281|DOI:https://doi.org/10.3115/v1/N15-1173
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
In this paper we have proposed a model for video description which uses neural networks for the entire pipeline from pixels to sentences and can potentially allow for the training and tuning of the entire network

Abstract:

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural net...More

Code:

Data:

0
Introduction
  • For most people, watching a brief video and describing what happened is an easy task.
  • Previous work has simplified the problem by detecting a fixed set of semantic roles, such as subject, verb, and object (Guadarrama et al, 2013; Thomason et al, 2014), as an intermediate representation
  • This fixed representation is problematic for large vocabularies and leads to oversimplified rigid sentence templates which are unable to model the complex structures of natural language
Highlights
  • For most people, watching a brief video and describing what happened is an easy task
  • Progress in open-domain video description has been difficult in part due to large vocabularies and
  • Our Long-Short Term Memory models We present four main models
  • With regard to the generation metrics BLEU and METEOR, training based on the full sentence helps the Long-Short Term Memory model develop fluency and vocabulary similar to that seen in the training descriptions and allows it to outperform the template based generation
  • In this paper we have proposed a model for video description which uses neural networks for the entire pipeline from pixels to sentences and can potentially allow for the training and tuning of the entire network
  • We showed that exploiting image description data improves performance compared to relying only on video description data
Methods
  • The dataset comes with several human generated descriptions in a number of languages; the authors use the roughly 40 available English descriptions per video
  • This dataset have been used in several prior works (Motwani and Mooney, 2012; Krishnamoorthy et al, 2013; Guadarrama et al, 2013; Thomason et al, 2014; Xu et al, 2015) on action recognition and video description tasks.
  • For the task the authors pick 1200 videos to be used as training data, 100 videos for validation and 670 videos for testing, as used by the prior works on video description (Guadarrama et al, 2013; Thomason et al, 2014; Xu et al, 2015)
Results
  • Earlier works (Krishnamoorthy et al, 2013; Guadarrama et al, 2013) that reported results on the YouTube dataset compared their method based on how well their model could predict the subject, verb, and object (SVO) depicted in the video.
  • Since these models first predicted the content (SVO triples) and generated the sentences, the S,V,O accuracy captured the quality of the content generated by the models.
  • The latter evaluation was reported by (Xu et al, 2015), so the authors include it here for comparison
Conclusion
  • The authors note that in the SVO binary accuracy metrics (Tables 1 and 2), the base LSTM model (LSTM-YT) achieves a slightly lower accuracy compared to prior work
  • This is likely due to the fact that previous work explicitly optimizes to identify the best subject, verb and object for a video; whereas the LSTM model is trained on objects and actions jointly in a sentence and needs to learn to interpret these in different contexts.
  • The authors will release the Caffe-based implementation, as well as the model and generated sentences
Summary
  • Introduction:

    For most people, watching a brief video and describing what happened is an easy task.
  • Previous work has simplified the problem by detecting a fixed set of semantic roles, such as subject, verb, and object (Guadarrama et al, 2013; Thomason et al, 2014), as an intermediate representation
  • This fixed representation is problematic for large vocabularies and leads to oversimplified rigid sentence templates which are unable to model the complex structures of natural language
  • Methods:

    The dataset comes with several human generated descriptions in a number of languages; the authors use the roughly 40 available English descriptions per video
  • This dataset have been used in several prior works (Motwani and Mooney, 2012; Krishnamoorthy et al, 2013; Guadarrama et al, 2013; Thomason et al, 2014; Xu et al, 2015) on action recognition and video description tasks.
  • For the task the authors pick 1200 videos to be used as training data, 100 videos for validation and 670 videos for testing, as used by the prior works on video description (Guadarrama et al, 2013; Thomason et al, 2014; Xu et al, 2015)
  • Results:

    Earlier works (Krishnamoorthy et al, 2013; Guadarrama et al, 2013) that reported results on the YouTube dataset compared their method based on how well their model could predict the subject, verb, and object (SVO) depicted in the video.
  • Since these models first predicted the content (SVO triples) and generated the sentences, the S,V,O accuracy captured the quality of the content generated by the models.
  • The latter evaluation was reported by (Xu et al, 2015), so the authors include it here for comparison
  • Conclusion:

    The authors note that in the SVO binary accuracy metrics (Tables 1 and 2), the base LSTM model (LSTM-YT) achieves a slightly lower accuracy compared to prior work
  • This is likely due to the fact that previous work explicitly optimizes to identify the best subject, verb and object for a video; whereas the LSTM model is trained on objects and actions jointly in a sentence and needs to learn to interpret these in different contexts.
  • The authors will release the Caffe-based implementation, as well as the model and generated sentences
Tables
  • Table1: SVO accuracy: Binary SVO accuracy compared against any valid S,V,O triples in the ground truth descriptions. We extract S,V,O values from sentences output by our model using a dependency parser. The model is correct if it identifies S,V, or O mentioned in any one of the multiple human descriptions
  • Table2: SVO accuracy: Binary SVO accuracy compared against most frequent S,V,O triple in the ground truth descriptions. We extract S,V,O values from parses of sentences output by our model using a dependency parser. The model is correct only if it outputs the most frequently mentioned S, V, O among the human descriptions
  • Table3: Scores for BLEU at 4 (combined n-gram 1-4), and METEOR scores from automated evaluation metrics comparing the quality of the generation. All values are reported as percentage (%)
  • Table4: Human evaluation mean scores. Sentences were uniquely ranked between 1 to 5 based on their relevance to a given video. Sentences were rated between 1 to 5 for grammatical correctness. Higher values are better
  • Table5: Scores for BLEU at 4 (combined n-gram 1-4), and METEOR scores comparing the quality of sentence generation by the models trained on Flickr30k and COCO and tested on a random frame from the video. LSTMYT-frame models were fine tuned on individual frames from the Youtube video dataset. All values are reported as percentage (%)
Download tables as Excel
Related work
  • Most of the existing research in video description has focused on narrow domains with limited vocabularies of objects and activities (Kojima et al, 2002; Lee et al, 2008; Khan and Gotoh, 2012; Barbu et al, 2012; Ding et al, 2012; Khan and Gotoh, 2012; Das et al, 2013b; Das et al, 2013a; Rohrbach et al, 2013; Yu and Siskind, 2013). For example, Rohrbach et al (2013), Rohrbach et al (2014) produce descriptions for videos of several people cooking in the same kitchen. These approaches generate sentences by first predicting a semantic role representation, e.g., modeled with a CRF, of high-level concepts such as the actor, action and object. Then they use a template or statistical machine translation to translate the semantic representation to a sentence.

    Most work on “in-the-wild” online video has focused on retrieval and predicting event tags rather than generating descriptive sentences; examples are tagging YouTube (Aradhye et al, 2009) and retrieving online video in the TRECVID competition (Over et al, 2012). Work on TRECVID has also included clustering both video and text features for video retrieval, e.g., (Wei et al, 2010; Huang et al, 2013).
Funding
  • Marcus Rohrbach was supported by a fellowship within the FITweltweitProgram of the German Academic Exchange Service (DAAD)
  • This research was partially supported by ONR ATL Grant N00014-11-1-010, NSF Awards IIS-1451244 and IIS-1212798
Reference
  • Ahmet Aker and Robert Gaizauskas. 2010. Generating image descriptions using dependency relational patterns. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • H. Aradhye, G. Toderici, and J. Yagnik. 2009. Video2text: Learning to annotate video content. In IEEE International Conference on Data Mining Workshops (ICDMW).
    Google ScholarLocate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
    Google ScholarLocate open access versionFindings
  • Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, and Zhiqi Zhang. 2012. Video in sentences out. In Association for Uncertainty in Artificial Intelligence (UAI).
    Google ScholarLocate open access versionFindings
  • David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
    Findings
  • P. Das, R. K. Srihari, and J. J. Corso. 2013a. Translating related words to videos and back through latent topics. In Proceedings of Sixth ACM International Conference on Web Search and Data Mining (WSDM).
    Google ScholarLocate open access versionFindings
  • P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013b. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • D. Ding, F. Metze, S. Rawat, P.F. Schulam, S. Burger, E. Younessian, L. Bao, M.G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: towards multimedia summarization. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (ICMR). ACM.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.
    Findings
  • Ryan Kiros, Ruslan Salakhuditnov, and Richard. S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
    Findings
  • Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press.
    Google ScholarFindings
  • A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision (IJCV), 50(2).
    Google ScholarLocate open access versionFindings
  • Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Polina Kuznetsova, Vicente Ordonez, Tamara L Berg, UNC Chapel Hill, and Yejin Choi. 2014. Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2(10).
    Google ScholarLocate open access versionFindings
  • M.W. Lee, A. Hakeem, N. Haering, and S.C. Zhu. 2008. Save: A framework for semantic annotation of visual events. In Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. arXiv preprint arXiv:1405.0312.
    Findings
  • Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090.
    Findings
  • Tanvi S. Motwani and Raymond J. Mooney. 2012. Improving video activity recognition using object recognition and text mining. In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI).
    Google ScholarLocate open access versionFindings
  • Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, B Shaw, Alan F. Smeaton, and Georges Queenot. 2012. TRECVID 2012 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition (GCPR), September.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge.
    Google ScholarFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R.J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), August.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and tell: A neural image caption generator. CoRR, abs/1411.4555.
    Findings
  • Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Nan Liu. 2010. Multimodal fusion for video search reranking. IEEE Transactions on Knowledge and Data Engineering,, 22(8).
    Google ScholarLocate open access versionFindings
  • R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • B.Z. Yao, X. Yang, L. Lin, M.W. Lee, and S.C. Zhu. 2010. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8).
    Google ScholarLocate open access versionFindings
  • Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from videos described with sentences. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615.
    Findings
  • Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV). Springer.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments