AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our experimental results show that Bi-directional Spatio-Temporal Learning can extract relevant, high-resolution visual cues from videos and generate quality dialogue responses/answers

BiST: Bi directional Spatio Temporal Reasoning for Video Grounded Dialogues

EMNLP 2020, pp.1846-1859, (2020)

Cited by: 1|Views396
Full Text
Bibtex
Weibo

Abstract

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focu...More

Code:

Data:

0
Introduction
  • A video-grounded dialogue agent aims to converse with humans based on signals from natural language and from other modalities such as sound and vision of the input video.
  • (Tapaswi et al, 2016; Jang et al, 2017; Lei et al, 2018) whereby the agent answers questions from humans over multiple turns rather than a single turn (See Figure 1)
  • This is a very complex task as the dialogue agent needs to possess strong language understanding to generate natural responses and sophisticated reasoning over video information, including the related objects, their positions and motions, etc.
Highlights
  • A video-grounded dialogue agent aims to converse with humans based on signals from natural language and from other modalities such as sound and vision of the input video
  • We proposed Bi-directional Spatio-Temporal Learning (BiST), a novel deep neural network approach for video-grounded dialogues and video QA, which exploits the complex visual nuances of videos through a bidirectional reasoning framework in both spatial and temporal dimensions
  • Our experimental results show that BiST can extract relevant, high-resolution visual cues from videos and generate quality dialogue responses/answers
Methods
  • The authors sample video clips to extract visual features with a window size of 16 frames, and stride of 16 and 4 in AVSD and TGIF-QA respectively.
  • In TGIF-QA experiments, the authors extract visual features from pretrained ResNet-152 (He et al, 2016) for a fair comparison with existing work.
  • In AVSD experiments, the authors make use of the video summary as the video-dependent text input Xcap.
  • The authors adopt the Adam optimizer (Kingma and Ba, 2015) and the learning rate
Results
  • The authors report the objective scores, including BLEU (Papineni et al, 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), and CIDEr (Vedantam et al, 2015).
  • These metrics, which formulate lexical overlaps between generated and ground-truth dialogue responses, are borrowed from language generation tasks such as machine translation and captioning.
Conclusion
  • The authors proposed BiST, a novel deep neural network approach for video-grounded dialogues and video QA, which exploits the complex visual nuances of videos through a bidirectional reasoning framework in both spatial and temporal dimensions.
  • The authors' experimental results show that BiST can extract relevant, high-resolution visual cues from videos and generate quality dialogue responses/answers
Tables
  • Table1: Summary of DSTC7 AVSD and TGIF-QA benchmark. The TGIF-QA contains 4 different tasks: (1) Count: open-ended QA which counts the number of repetitions of an action. (2) Action: multi-choice (MC) QA about a certain action occurring a fixed number of times. (3) Transition: MC QA about the temporal variation of video. (4) Frame: open-ended QA which can be answered from one video frame
  • Table2: Evaluation results on the test split of the AVSD benchmark. The results are presented in 4 settings by video feature components: (1) visual-only, (2) visual and text, (3) visual and audio, and (4) visual, audio, and text
  • Table3: Evaluation results on the test split of the TGIF-QA benchmark. Visual features are: R(ResNet), C(C3D), F(FlowCNN), RX(ResNext)
  • Table4: Ablation analysis on the AVSD benchmark with variants of BiST by spatio-temporal dynamics
  • Table5: Performance of model variants by N = Natt = Ndec, and hatt on the AVSD benchmark responses. Another observation is that for ambiguous examples such as Example C (where the visual appearance is not clear to differentiate “apartment” and “business office”), our model can return the correct answer. Potentially this can be explained by the extracted signals from spatial-level feature representations. Finally, we note that there are still some errors that make the output sentences partially wrong, such as mismatching subjects (example A), wrong entities (Example B), or wrong actions (Example C). For detailed qualitative analysis, please refer to Appendix B
Download tables as Excel
Related work
  • Our work is related to two research topics: videogrounded dialogues and spatio-temporal learning. Video-grounded Dialogues. Following recent efforts that combine NLP and Computer Vision research (Antol et al, 2015; Xu et al, 2015; Goyal et al, 2017), video-grounded dialogues are extended from the two major research fields: video action recognition and detection (Simonyan and Zisserman, 2014; Yang et al, 2016; Carreira and Zisserman, 2017) and dialogues/QA (Rajpurkar et al, 2016; Budzianowski et al, 2018; Gao et al, 2019a). Approaches to video-grounded dialogues (Sanabria et al, 2019; Hori et al, 2019; Le et al, 2019b) typically use pretrained video models, such as 2D CNN models on video frames (Donahue et al, 2015; Feichtenhofer et al, 2016), and 3D CNN models on video clips (Tran et al, 2015; Carreira and Zisserman, 2017), to extract visual features. However, these approaches mostly exploit the superficial information from the temporal dimension and neglect spatial-level signals. These approaches integrate spatial-level features simply through sum pooling with equal weights to obtain a global representation at the temporal level. They are, thus, not ideal for complex questions that investigate entity-level or spatial-level information (Jang et al, 2017; Alamri et al, 2019). The dialogue setting exacerbates this limitation as it allows users to explore various aspects of the video contents, including both low-level (spatial) and high-level (temporal) information, over multiple dialogue turns. Our approach aims to address this challenge in video-grounded dialogues by retrieving fine-grained information from video through a bidirectional reasoning framework. Spatio-temporal Learning. Most efforts in spatiotemporal learning focus on action recognition or detection tasks. (Yang et al, 2019) proposes to progressively refine coarse-scale information through temporal extension and spatial displacement for action detection. (Li et al, 2019a) uses a shared network of 2D CNNs over three orthogonal views of video to obtain spatial and temporal signals for action recognition. (Qiu et al, 2019) adopts a twopath network architecture that integrates global and local information of both temporal and spatial dimensions for video classification. Other research areas that investigate spatio-temporal learning include video captioning (Aafaq et al, 2019), video super-resolution (Li et al, 2019b), and video object segmentation (Xu et al, 2019). In general, spatio-temporal learning approaches aim to process higher-resolution information from complex videos that involve multiple objects in each video frame or motions over video segments (Yang et al, 2019). We are motivated by a similar reason observed in video-grounded dialogues and explore a vision-language bidirectional reasoning approach to obtain more fine-grained visual features.
Reference
  • Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12487–12496.
    Google ScholarLocate open access versionFindings
  • Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Stefan Lee, Peter Anderson, Irfan Essa, Devi Parikh, Dhruv Batra, Anoop Cherian, Tim K. Marks, and Chiori Hori. 2019. Audio-visual scene-aware dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Huda Alamri, Chiori Hori, Tim K Marks, Dhruv Batra, and Devi Parikh. 2018. Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In DSTC7 at AAAI2019 Workshop, volume 2.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
    Google ScholarLocate open access versionFindings
  • Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308.
    Google ScholarLocate open access versionFindings
  • Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634.
    Google ScholarLocate open access versionFindings
  • Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1999–2007.
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941.
    Google ScholarLocate open access versionFindings
  • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 457–468, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019a. Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval, 13(23):127–298.
    Google ScholarLocate open access versionFindings
  • Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585.
    Google ScholarLocate open access versionFindings
  • Lianli Gao, Pengpeng Zeng, Jingkuan Song, YuanFang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019b. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6391–6398.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 20Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, volume 1, page 3.
    Google ScholarLocate open access versionFindings
  • Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 20Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. Cnn architectures for largescale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 131–135. IEEE.
    Google ScholarLocate open access versionFindings
  • C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, A. Das, I. Essa, D. Batra, and D. Parikh. 2019. Endto-end audio visual scene-aware dialog using multimodal attention-based video features. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2352–2356.
    Google ScholarLocate open access versionFindings
  • Chiori Hori, Anoop Cherian, Tim K Marks, and Takaaki Hori. 2019. Joint student-teacher learning for audio-visual scene-aware dialog. Proc. Interspeech 2019, pages 1886–1890.
    Google ScholarLocate open access versionFindings
  • Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Locationaware graph convolutional networks for video question answering. In The AAAI Conference on Artificial Intelligence (AAAI), volume 1.
    Google ScholarLocate open access versionFindings
  • Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatiotemporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766.
    Google ScholarLocate open access versionFindings
  • Jianwen Jiang, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In The AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In The AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Hung Le, S Hoi, Doyen Sahoo, and N Chen. 2019a. End-to-end multimodal dialog systems with hierarchical multimodal attention on video features. In DSTC7 at AAAI2019 workshop.
    Google ScholarLocate open access versionFindings
  • Hung Le, Doyen Sahoo, Nancy Chen, and Steven Hoi. 2019b. Multimodal transformer networks for end-toend video-grounded dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5612–5623, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chenyi Lei, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. 2020. Multiquestion learning for visual question answering. In The AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1369–1379, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. 2019a. Collaborative spatiotemporal feature learning for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Sheng Li, Fengxiang He, Bo Du, Lefei Zhang, Yonghao Xu, and Dacheng Tao. 2019b. Fast spatio-temporal residual network for video superresolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019c. Beyond rnns: Positional self-attention with co-attention for video question answering. In The 33rd AAAI Conference on Artificial Intelligence, volume 8.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
    Google ScholarFindings
  • Dat Tien Nguyen, Shikhar Sharma, Hannes Schulz, and Layla El Asri. 2018. From film to video: Multiturn question answering with multi-modal context. In AAAI 2019 Dialog System Technology Challenge (DSTC7) Workshop.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. 2019. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12056–12065.
    Google ScholarLocate open access versionFindings
  • Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2019c. Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical conditional relation networks for video question answering. arXiv preprint arXiv:2002.10698.
    Findings
  • Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in neural information processing systems, pages 2953–2961.
    Google ScholarLocate open access versionFindings
  • Ramon Sanabria, Shruti Palaskar, and Florian Metze. 2019. Cmu sinbad’s submission for the dstc7 avsd challenge. In DSTC7 at AAAI2019 workshop, volume 6.
    Google ScholarFindings
  • Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G Schwing. 2019. Factor graph attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2039– 2048.
    Google ScholarLocate open access versionFindings
  • Gunnar A Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. 2014. Twostream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
    Google ScholarLocate open access versionFindings
  • Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2692–2700. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500.
    Google ScholarLocate open access versionFindings
  • Kai Xu, Longyin Wen, Guorong Li, Liefeng Bo, and Qingming Huang. 2019. Spatiotemporal cnn for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1379–1388.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2016. Multilayer and multimodal fusion of deep neural networks for video classification. In Proceedings of the 24th ACM international conference on multimedia, pages 978–987. ACM.
    Google ScholarLocate open access versionFindings
  • Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. 2019. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 264–272.
    Google ScholarLocate open access versionFindings
  • Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4584–4593.
    Google ScholarLocate open access versionFindings
  • Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3165– 3173.
    Google ScholarLocate open access versionFindings
  • A TGIF-QA Baselines In TGIF-QA experiments, we compare our models with the following baselines: (1) VIS (Ren et al., 2015) and (2) MCB (Fukui et al., 2016) are two image-based VQA baselines which were adapted to TGIF-QA by (Jang et al., 2017).
    Google ScholarLocate open access versionFindings
  • (3) Yu et al. (Yu et al., 2017) uses a high-level concept word detector and the detected words are used for semantic reasoning.
    Google ScholarFindings
  • (4) ST-VQA (Jang et al., 2017) integrates temporal and spatial features by first pretraining temporal part and then finetuning the spatial part.
    Google ScholarFindings
  • (5) Co-Mem (Gao et al., 2018) includes a co-memory mechanism on two video streams based on motion and appearance features.
    Google ScholarFindings
  • (6) PSAC (Li et al., 2019c) uses multi-head attention layers to exploit the dependencies between text and temporal variation of video.
    Google ScholarFindings
  • (7) HME (Fan et al., 2019) is a memory network with read and write operations to update global context representations.
    Google ScholarFindings
  • (8) STA (Gao et al., 2019b) divides video into N segments and uses temporal attention modules on each segment independently.
    Google ScholarFindings
  • (9) CRN+MAC (Le et al., 2019c) is a clip-based reasoning framework by aggregating frame-level features into clips through temporal attention.
    Google ScholarFindings
  • (10) MQL (Lei et al., 2020) exploits the semantic relations among questions and proposes a multi-label prediction task.
    Google ScholarFindings
  • (11) QueST (Jiang et al., 2020) has two types of question embeddings: spatial and temporal embeddings based on attention guided by video features.
    Google ScholarFindings
  • (12) HGA (Jiang and Han, 2020) is a graph alignment network consisting of inter- and intra-modality edges to model the interaction between video and question.
    Google ScholarFindings
  • (13) GCN (Huang et al., 2020) is a similar approach with graph network but utilizes the video object-level features as node representations.
    Google ScholarFindings
  • (14) HCRN (Le et al., 2020) extends (Le et al., 2019c) with a hierarchical relation network over temporallevel video features.
    Google ScholarFindings
  • We present additional example outputs in Figure 4. For each examples, we include the last dialogue turn from the dialogue history. In general, BiST can generate responses that better match the ground truth than the Baseline (Hori et al., 2019) and MTN (Le et al., 2019b) (example A, B). Furthermore, we analyze both negative and positive outputs and have the following observations:
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科