Multi-modal Transformer for Video Retrieval
european conference on computer vision, pp. 214-229, 2020.
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited...More
PPT (Upload PPT)