Visual Redundancy Removal of Composite Images via Multimodal Learning

Wuyuan Xie, Shukang Wang, Rong Zhang,Miaohui Wang

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览6
暂无评分
摘要
Composite images are generated by combining two or more different photographs, and their content is typically heterogeneous. However, existing unimodal visual redundancy prediction methods are difficult to accurately model the complex characteristics of this image type. In this paper, we investigate the visual redundancy modeling of composite images from an end-to-end multimodal perspective, including four cross-media modalities (i.e., text, brightness, color, and segmentation). Specifically, we design a two-stage cross-modal alignment module based on self-attention mechanism and contrastive learning, and develop a fusion module based on a cross-modal augmentation paradigm. Further, we establish the first cross-media visual redundancy dataset for composite images, which contains 413 groups of cross-modal data and generates 13629 realistic compression distortions using the latest versatile video coding (VVC) standard. Experimental results on nine benchmark datasets demonstrate the effectiveness of our method, outperforming seven representative methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要