ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

Han Fang,Xianghao Zang,Chao Ban,Zerun Feng,Lanxiang Zhou,Zhongjiang He,Yongxiang Li,Hao Sun

IEEE International Conference on Multimedia and Expo（2024）

China Telecom Corporation Ltd. Data&AI Technology Company

Cited 0|Views30

Abstract

Text-video retrieval aims to find the most relevant cross-modal samples for agiven query. Recent methods focus on modeling the whole spatial-temporalrelations. However, since video clips contain more diverse content thancaptions, the model aligning these asymmetric video-text pairs has a high riskof retrieving many false positive results. In this paper, we proposeProbabilistic Token Aggregation (ProTA) to handle cross-modalinteraction with content asymmetry. Specifically, we propose dualpartial-related aggregation to disentangle and re-aggregate tokenrepresentations in both low-dimension and high-dimension spaces. We proposetoken-based probabilistic alignment to generate token-level probabilisticrepresentation and maintain the feature representation diversity. In addition,an adaptive contrastive loss is proposed to learn compact cross-modaldistribution space. Based on extensive experiments, ProTA achievessignificant improvements on MSR-VTT (50.9

Translated text

Key words

Text-video Retrieval,Token Aggregation,Probabilistic Distribution

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Try using models to generate summary,it takes about 60s

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：本文提出Probabilistic Token Aggregation (ProTA)模型，用于处理视频和文本之间的非对称跨模态交互，通过双重部分相关聚合并重新聚合低维和高维空间中的令牌表示，引入了基于令牌的概率对齐来生成令牌级别的概率表示并保持特征表示的多样性，还提出了自适应对比度损失来学习紧凑的跨模态分布空间。

【方法】：提出双重部分相关聚合并重新聚合低维和高维空间中的令牌表示，以及基于令牌的概率对齐和自适应对比度损失。

【实验】：在MSR-VTT数据集上，ProTA取得了显著的性能提升，得分为50.9。

去 AI 文献库对话