Chrome Extension
WeChat Mini Program
Use on ChatGLM

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

IEEE International Conference on Multimedia and Expo(2024)

China Telecom Corporation Ltd. Data&AI Technology Company

Cited 0|Views30
Abstract
Text-video retrieval aims to find the most relevant cross-modal samples for agiven query. Recent methods focus on modeling the whole spatial-temporalrelations. However, since video clips contain more diverse content thancaptions, the model aligning these asymmetric video-text pairs has a high riskof retrieving many false positive results. In this paper, we proposeProbabilistic Token Aggregation (ProTA) to handle cross-modalinteraction with content asymmetry. Specifically, we propose dualpartial-related aggregation to disentangle and re-aggregate tokenrepresentations in both low-dimension and high-dimension spaces. We proposetoken-based probabilistic alignment to generate token-level probabilisticrepresentation and maintain the feature representation diversity. In addition,an adaptive contrastive loss is proposed to learn compact cross-modaldistribution space. Based on extensive experiments, ProTA achievessignificant improvements on MSR-VTT (50.9
More
Translated text
Key words
Text-video Retrieval,Token Aggregation,Probabilistic Distribution
PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Try using models to generate summary,it takes about 60s
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出Probabilistic Token Aggregation (ProTA)模型,用于处理视频和文本之间的非对称跨模态交互,通过双重部分相关聚合并重新聚合低维和高维空间中的令牌表示,引入了基于令牌的概率对齐来生成令牌级别的概率表示并保持特征表示的多样性,还提出了自适应对比度损失来学习紧凑的跨模态分布空间。

方法】:提出双重部分相关聚合并重新聚合低维和高维空间中的令牌表示,以及基于令牌的概率对齐和自适应对比度损失。

实验】:在MSR-VTT数据集上,ProTA取得了显著的性能提升,得分为50.9。