Chrome Extension
WeChat Mini Program
Use on ChatGLM

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval.

Han Fang,Xianghao Zang,Chao Ban,Zerun Feng, Lanxiang Zhou,Zhongjiang He, Yongxiang Li, Hao Sun

CoRR(2024)

Cited 0|Views23
No score
Abstract
Text-video retrieval aims to find the most relevant cross-modal samples for agiven query. Recent methods focus on modeling the whole spatial-temporalrelations. However, since video clips contain more diverse content thancaptions, the model aligning these asymmetric video-text pairs has a high riskof retrieving many false positive results. In this paper, we proposeProbabilistic Token Aggregation (ProTA) to handle cross-modalinteraction with content asymmetry. Specifically, we propose dualpartial-related aggregation to disentangle and re-aggregate tokenrepresentations in both low-dimension and high-dimension spaces. We proposetoken-based probabilistic alignment to generate token-level probabilisticrepresentation and maintain the feature representation diversity. In addition,an adaptive contrastive loss is proposed to learn compact cross-modaldistribution space. Based on extensive experiments, ProTA achievessignificant improvements on MSR-VTT (50.9
More
Translated text
Key words
Text-video Retrieval,Token Aggregation,Probabilistic Distribution
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined