Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks

International Journal of Computer Vision(2020)

引用 12|浏览51
暂无评分
摘要
In this paper, we consider the problem of image-to-video translation, where one or a set of input images are translated into an output video which contains motions of a single object. Especially, we focus on predicting motions conditioned by high-level structures, such as facial expression and human pose. Recent approaches are either driven by structural conditions or temporal-based. Condition-driven approaches typically train transformation networks to generate future frames conditioned on the predicted structural sequence. Temporal-based approaches, on the other hand, have shown that short high-quality motions can be generated using 3D convolutional networks with temporal knowledge learned from massive training data. In this work, we combine the benefits of both approaches and propose a two-stage generative framework where videos are forecast from the structural sequence and then refined by temporal signals. To model motions more efficiently in the forecasting stage, we train networks with dense connections to learn residual motions between the current and future frames, which avoids learning motion-irrelevant details. To ensure temporal consistency in the refining stage, we adopt the ranking loss for adversarial training. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state of the art on both tasks demonstrate the effectiveness of our approach.
更多
查看译文
关键词
Image-to-video translation, Video generation, Multi-stage GANs, Motion prediction, Residual learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要