Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives
Conference on Empirical Methods in Natural Language Processing(2022)
Key words
Visual Question Answering,Video Summarization,Feature Matching,Image Captioning,Semantic Analysis
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined