WTVI: A Wavelet-Based Transformer Network for Video Inpainting

IEEE SIGNAL PROCESSING LETTERS(2024)

引用 0|浏览5
暂无评分
摘要
Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.
更多
查看译文
关键词
Discrete wavelet transforms,Transformers,Frequency-domain analysis,Wavelet domain,Semantics,Low-pass filters,Image restoration,Video inpainting,DWT,vision transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要