ProTA: Probabilistic Token Aggregation for Text-Video Retrieval.

Han Fang,Xianghao Zang,Chao Ban,Zerun Feng, Lanxiang Zhou,Zhongjiang He, Yongxiang Li, Hao Sun


Text-video retrieval aims to find the most relevant cross-modal samples for agiven query. Recent methods focus on modeling the whole spatial-temporalrelations. However, since video clips contain more diverse content thancaptions, the model aligning these asymmetric video-text pairs has a high riskof retrieving many false positive results. In this paper, we proposeProbabilistic Token Aggregation (ProTA) to handle cross-modalinteraction with content asymmetry. Specifically, we propose dualpartial-related aggregation to disentangle and re-aggregate tokenrepresentations in both low-dimension and high-dimension spaces. We proposetoken-based probabilistic alignment to generate token-level probabilisticrepresentation and maintain the feature representation diversity. In addition,an adaptive contrastive loss is proposed to learn compact cross-modaldistribution space. Based on extensive experiments, ProTA achievessignificant improvements on MSR-VTT (50.9
Text-video Retrieval,Token Aggregation,Probabilistic Distribution
