Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
european conference on computer vision, pp. 333-351, 2020.
Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by...More
PPT (Upload PPT)