Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

CVPR, pp. 10867-10876, 2020.

Cited by: 0|Bibtex|Views62|
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Keywords:
graph convolutional networkToyota Research InstituteMicrosoft Video Description Corpusdistillation mechanismConvolutional Neural NetworksMore(17+)
Weibo:
We propose a novel spatio-temporal graph network for video captioning to explicitly exploit the spatiotemporal object interaction, which is crucial for scene understanding and description

Abstract:

Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correl...More

Code:

Data:

0
Introduction
  • Because of the diverse set of entities involved, and the complex interactions among them.
  • Current methods for video captioning are not able to capture these interactions.
  • Rather than modeling the correlations among high-level semantic entities, current methods build connections directly on raw pixels and rely on the hierarchical deep neural network structure to capture higher-level relationships [19, 39].
  • Some works try operating on object features instead, but they either ignore crossobject interaction [49], or object transformation over time [27, 51].
  • Despite efforts in directly modeling local object features, the connections among them are not interpretable [27, 51], and sensitive to spurious correlations
Highlights
  • Scenes are complicated, because of the diverse set of entities involved, and the complex interactions among them
  • (2) We propose an object-aware knowledge distillation mechanism to solve the problem of noisy feature learning that exists in previous spatio-temporal graph models
  • We first compare our approach against earlier methods, including RecNet [40], which adds one reconstructor on top of the traditional encoder-decoder framework to reconstruct the visual features from the generated caption, and PickNet [6] which dynamically attends to frames by maximizing a picking policy
  • We summarize the following reasons for this: (1) Microsoft Research-Video to Text contains a large portion of animations, on which object detectors generally fail, making it much harder for our proposed spatio-temporal graph to capture object interactions in them; (2) The two very recent methods, i.e., Wang et al [39] and Hou et al [19] both directly optimize the decoding part, which are generally easier to perform well on language metrics compared to methods that focus on the encoding part, such as ours; (3) The more advanced features used (IRv2+I3D optical flow for Wang et al [39] and IRv2+C3D for Hou et al [19]) make it unfair to directly compare with them
  • We propose a novel spatio-temporal graph network for video captioning to explicitly exploit the spatiotemporal object interaction, which is crucial for scene understanding and description
  • We design a twobranch framework with a proposed object-aware knowledge distillation mechanism, which solves the problem of noisy feature learning present in previous spatio-temporal graph models
Methods
  • An overview of the proposed two-branch network architecture is illustrated in Fig. 2.
  • Given a video that depicts a dynamic scene, the goal is to condense it into a representation that fully captures the spatio-temporal object interaction
  • This is done via the proposed spatio-temporal graph network, which serves as the object branch.
  • Wang et al [39] 42.0 Hou et al [19] RecNet [40] PickNet [6] OA-BTG [49] MARN [30].
Results
  • The authors evaluate the proposed model on two challenging benchmark datasets: Microsoft Research-Video to Text (MSR-VTT) [46] and Microsoft Video Description Corpus (MSVD) [3].
  • MSR-VTT is a widely used large-scale benchmark dataset for video captioning.
  • It consists of 10000 video clips, each human-annotated with 20 English sentences.
  • OA-BTG [49] constructs object trajectories by tracking the same objects through time
  • While these works generally focus on the encoding side, Wang et al [39] and Hou et al [19] focus on the language decoding part and both propose to predict the POS structure first and use that to guide the sentence generation
Conclusion
  • The authors propose a novel spatio-temporal graph network for video captioning to explicitly exploit the spatiotemporal object interaction, which is crucial for scene understanding and description.
  • The authors design a twobranch framework with a proposed object-aware knowledge distillation mechanism, which solves the problem of noisy feature learning present in previous spatio-temporal graph models.
  • The authors demonstrate the effectiveness of the approach on two benchmark video captioning dataset
Summary
  • Introduction:

    Because of the diverse set of entities involved, and the complex interactions among them.
  • Current methods for video captioning are not able to capture these interactions.
  • Rather than modeling the correlations among high-level semantic entities, current methods build connections directly on raw pixels and rely on the hierarchical deep neural network structure to capture higher-level relationships [19, 39].
  • Some works try operating on object features instead, but they either ignore crossobject interaction [49], or object transformation over time [27, 51].
  • Despite efforts in directly modeling local object features, the connections among them are not interpretable [27, 51], and sensitive to spurious correlations
  • Methods:

    An overview of the proposed two-branch network architecture is illustrated in Fig. 2.
  • Given a video that depicts a dynamic scene, the goal is to condense it into a representation that fully captures the spatio-temporal object interaction
  • This is done via the proposed spatio-temporal graph network, which serves as the object branch.
  • Wang et al [39] 42.0 Hou et al [19] RecNet [40] PickNet [6] OA-BTG [49] MARN [30].
  • Results:

    The authors evaluate the proposed model on two challenging benchmark datasets: Microsoft Research-Video to Text (MSR-VTT) [46] and Microsoft Video Description Corpus (MSVD) [3].
  • MSR-VTT is a widely used large-scale benchmark dataset for video captioning.
  • It consists of 10000 video clips, each human-annotated with 20 English sentences.
  • OA-BTG [49] constructs object trajectories by tracking the same objects through time
  • While these works generally focus on the encoding side, Wang et al [39] and Hou et al [19] focus on the language decoding part and both propose to predict the POS structure first and use that to guide the sentence generation
  • Conclusion:

    The authors propose a novel spatio-temporal graph network for video captioning to explicitly exploit the spatiotemporal object interaction, which is crucial for scene understanding and description.
  • The authors design a twobranch framework with a proposed object-aware knowledge distillation mechanism, which solves the problem of noisy feature learning present in previous spatio-temporal graph models.
  • The authors demonstrate the effectiveness of the approach on two benchmark video captioning dataset
Tables
  • Table1: Comparison with other methods on MSR-VTT (%). “-” means number not available. The first section includes methods that optimize language decoding, while the second is for those that focus on visual encoding
  • Table2: Comparison with other methods on MSVD (%)
  • Table3: Ablation study on MSVD (%)
Download tables as Excel
Related work
  • General Video Classification. Spatio-temporal reasoning is one of the main topics for video understanding. With the success of deep Convolutional Neural Networks (CNNs) on image recognition [24], many deep architectures have been proposed correspondingly in the space-time domain. C3D [33] and I3D [2] construct hierarchical spatio-temporal understanding by performing 3D convolution. The two-stream network [10] receives additional motion information by fusing an extra optical flow branch. TSN [41], on the other hand, takes advantage of the fact that huge redundancy exists between adjacent video frames via sparse frame sampling. While arguing that previous methods fail to capture long-term dependency, several recent works [9, 42, 44, 50] attempt to model a wider temporal range. Specifically, TRN [50] extends TSN by considering multi-level sampling frequency. The non-local network [42] explicitly creates longterm spatio-temporal links among features. The SlowFast network [9] exploits multiple time scales by creating two pathways with different temporal resolutions. Alternatively, the long-term feature bank [44] directly stores long-term features and later correlates them with short-term features. However, all these models directly reason over raw pixels, which often fail to ground their predictions to visual evidence by simply collecting data bias. In contrast, we propose to model relationships over higher-level entities, which in our case, are the objects within scenes. Spatio-Temporal Graphs. While the idea of graphical scene representation has been explored extensively in the image domain [20, 23, 48], its extension to videos has only been recently attracting attention. Among the earlier attempts, ST-GCN [47] models human body joint coordinates to perform action classification. Later works directly model the objects in a scene. The resulting representation is then used to perform various down-stream tasks, such as action classification [17, 43, 45], action localization [11, 28], relation prediction [34], and gaze prediction [8]. All these works aim for simple classification or localization tasks where capturing object interactions might not be as important. Thus the effect of spatio-temporal graph remains unclear. In this work, we target at the much harder task
Funding
  • Proposes a spatio-temporal graph model to explicitly capture such information for video captioning
  • Proposes a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time
  • Proposes an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features
  • Demonstrates the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions
  • Introduces a two-branch network structure, where an object branch captures object interaction as privileged information, and injects it into a scene branch by performing knowledge distillation between their language logits
Reference
  • Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
    Google ScholarLocate open access versionFindings
  • David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190–200. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • Ming Chen, Yingming Li, Zhongfei Zhang, and Siyu Huang. Tvt: Two-view transformer network for video captioning. In Asian Conference on Machine Learning, pages 847–862, 2018.
    Google ScholarLocate open access versionFindings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
    Findings
  • Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 358–373, 2018.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. arXiv preprint arXiv:1909.02144, 2019.
    Findings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
    Findings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
    Google ScholarLocate open access versionFindings
  • Pallabi Ghosh, Yi Yao, Larry S Davis, and Ajay Divakaran. Stacked spatio-temporal graph convolutional networks for action segmentation. arXiv preprint arXiv:1811.10575, 2018.
    Findings
  • Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE international conference on computer vision, pages 2712–2719, 2013.
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Roei Herzig, Elad Levi, Huijuan Xu, Eli Brosh, Amir Globerson, and Trevor Darrell. Classifying collisions with spatio-temporal action graph networks. arXiv preprint arXiv:1812.01233, 2018.
    Findings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. Joint syntax representation learning and visual cue translation for video captioning. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
    Findings
  • Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
    Findings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
    Google ScholarFindings
  • David Lopez-Paz, Leon Bottou, Bernhard Scholkopf, and Vladimir Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
    Findings
  • Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. Attend and interact: Higherorder object interactions for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6790–6800, 2018.
    Google ScholarLocate open access versionFindings
  • Effrosyni Mavroudi, Benjamın Bejar Haro, and Rene Vidal. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385, 2019.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019.
    Google ScholarLocate open access versionFindings
  • Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. Coherent multi-sentence video description with variable level of detail. In German conference on pattern recognition, pages 184–195.
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 433– 440, 2013.
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
    Google ScholarLocate open access versionFindings
  • Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
    Google ScholarLocate open access versionFindings
  • Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
    Google ScholarLocate open access versionFindings
  • Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
    Findings
  • Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guidance based on gated fusion network. arXiv preprint arXiv:1908.10072, 2019.
    Findings
  • Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7622–7631, 2018.
    Google ScholarLocate open access versionFindings
  • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2018.
    Google ScholarLocate open access versionFindings
  • Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
    Google ScholarLocate open access versionFindings
  • Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9964–9974, 2019.
    Google ScholarLocate open access versionFindings
  • Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
    Google ScholarLocate open access versionFindings
  • Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018.
    Google ScholarLocate open access versionFindings
  • Junchao Zhang and Yuxin Peng. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8327–8336, 2019.
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. Grounded video description. In CVPR, 2019.
    Google ScholarFindings
  • Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739– 8748, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments