Learning Blind Video Temporal Consistency

COMPUTER VISION - ECCV 2018, PT 15, (2018): 179-195

Cited: 149|Views66
EI

Abstract

Applying image processing algorithms independently to each frame of a video often leads to undesired inconsistent results over time. Developing temporally consistent video-based extensions, however, requires domain knowledge for individual tasks and is unable to generalize to other applications. In this paper, we present an efficient appr...More

Code:

Data:

0
Introduction
  • Recent advances of deep convolutional neural networks (CNNs) have led to the development of many powerful image processing techniques including, image filtering [30,37], enhancement [10,24,38], style transfer [17,23,29], colorization [19,41], and general image-to-image translation tasks [21,27,43]
  • Extending these CNN-based methods to video is non-trivial due to memory and computational constraints, and the availability of training datasets.
  • Due to the dependency of flow computation at test time, these approaches tend to be slow
Highlights
  • Recent advances of deep convolutional neural networks (CNNs) have led to the development of many powerful image processing techniques including, image filtering [30,37], enhancement [10,24,38], style transfer [17,23,29], colorization [19,41], and general image-to-image translation tasks [21,27,43]
  • We show that our single model can handle multiple and unseen tasks, including but not limited to artistic style transfer, enhancement, colorization, image-to-image translation and intrinsic image decomposition
  • Applying image-based algorithms independently to each video frame typically leads to temporal flickering due to the instability of global optimization algorithms or highly non-linear deep networks
  • We address the temporal consistency problem on a wide range of applications, including automatic white balancing [14], harmonization [4], dehazing [13], image enhancement [10], style transfer [17,23,29], colorization [19,41], image-to-image translation [21,43], and intrinsic image decomposition [3]
  • We propose a deep recurrent neural network to reduce the temporal flickering problem in per-frame processed videos
  • We demonstrate that the proposed algorithm performs favorably against existing blind temporal consistency method on a diverse set of applications and various types of videos
Results
  • The authors first describe the employed datasets for training and testing, followed by the applications of the proposed method and the metrics for evaluating the temporal stability and perceptual similarity.
  • The lengths of the videos in the DAVIS dataset are usually short with 4,209 training frames in total.
  • The authors scale the height of video frames to 480 and keep the aspect ratio
  • The authors use both the DAVIS and Videvo training sets, which contains a total of 25,735 frames, to train the network
Conclusion
  • The authors propose a deep recurrent neural network to reduce the temporal flickering problem in per-frame processed videos.
  • The authors' approach is agnostic to the underlying image-based algorithms applied to the video and generalize to a wide range of unseen applications.
  • The authors demonstrate that the proposed algorithm performs favorably against existing blind temporal consistency method on a diverse set of applications and various types of videos
Tables
  • Table1: Comparison of blind temporal consistency methods. Both the methods of Bonneel et al [<a class="ref-link" id="c6" href="#r6">6</a>] and Yao et al [<a class="ref-link" id="c39" href="#r39">39</a>] require dense correspondences from optical flow or PatchMatch [<a class="ref-link" id="c2" href="#r2">2</a>], while the proposed method does not explicitly rely on these correspondences at test time. The algorithm of Yao et al [<a class="ref-link" id="c39" href="#r39">39</a>] involves a key-frame selection from the entire video and thus cannot generate output in an online manner
  • Table2: Quantitative evaluation on temporal warping error. The “Trained” column indicates the applications used for training our model. Our method achieves a similarly reduced temporal warping error as Bonneel et al [<a class="ref-link" id="c6" href="#r6">6</a>], which is significantly less than the original processed video (Vp)
  • Table3: Quantitative evaluation on perceptual distance. Our method has lower perceptual distance than Bonneel et al [<a class="ref-link" id="c6" href="#r6">6</a>]
Download tables as Excel
Related work
  • We address the temporal consistency problem on a wide range of applications, including automatic white balancing [14], harmonization [4], dehazing [13], image enhancement [10], style transfer [17,23,29], colorization [19,41], image-to-image translation [21,43], and intrinsic image decomposition [3]. A complete review of these applications is beyond the scope of this paper. In the following, we discuss task-specific and task-independent approaches that enforce temporal consistency on videos.

    Task-specific approaches. A common solution to embed the temporal consistency constraint is to use optical flow to propagate information between frames, Bonneel et al [6] Yao et al [39] Ours

    Content constraint Short-term temporal constraint Long-term temporal constraint Require dense correspondences (at test time) Online processing gradient - local affine -

    perceptual loss -

    e.g., colorization [28] and intrinsic decomposition [40]. However, estimating optical flow is computationally expensive and thus is impractical to apply on highresolution and long sequences. Temporal filtering is an efficient approach to extend image-based algorithms to videos, e.g., tone-mapping [1], color transfer [5], and visual saliency [25] to generate temporally consistent results. Nevertheless, these approaches assume a specific filter formulation and cannot be generalized to other applications.
Funding
  • This work is supported in part by the NSF CAREER Grant #1149783, NSF Grant No # 1755785, and gifts from Adobe and Nvidia
Reference
  • Aydin, T.O., Stefanoski, N., Croci, S., Gross, M., Smolic, A.: Temporally coherent local tone mapping of HDR video. ACM TOG (2014)
    Google ScholarFindings
  • Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: A randomized correspondence algorithm for structural image editing. ACM TOG (2009)
    Google ScholarLocate open access versionFindings
  • Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM TOG (2014)
    Google ScholarLocate open access versionFindings
  • Bonneel, N., Rabin, J., Peyre, G., Pfister, H.: Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision (2015)
    Google ScholarLocate open access versionFindings
  • Bonneel, N., Sunkavalli, K., Paris, S., Pfister, H.: Example-based video color grading. ACM TOG (2013)
    Google ScholarLocate open access versionFindings
  • Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., Pfister, H.: Blind video temporal consistency. ACM TOG (2015)
    Google ScholarLocate open access versionFindings
  • Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV (2017)
    Google ScholarFindings
  • Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: ICCV (2017)
    Google ScholarFindings
  • Dong, X., Bonev, B., Zhu, Y., Yuille, A.L.: Region-based temporally consistent video post-processing. In: CVPR (2015)
    Google ScholarFindings
  • Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM TOG (2017)
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
    Google ScholarFindings
  • Gupta, A., Johnson, J., Alahi, A., Fei-Fei, L.: Characterizing and improving stability in neural style transfer. In: ICCV (2017)
    Google ScholarFindings
  • He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. TPAMI (2011)
    Google ScholarLocate open access versionFindings
  • Hsu, E., Mertens, T., Paris, S., Avidan, S., Durand, F.: Light mixture estimation for spatially varying white balance. ACM TOG (2008)
    Google ScholarLocate open access versionFindings
  • Huang, H., Wang, H., Luo, W., Ma, L., Jiang, W., Zhu, X., Li, Z., Liu, W.: Realtime neural style transfer for videos. In: CVPR (2017)
    Google ScholarFindings
  • Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of dynamic video. ACM TOG (2016)
    Google ScholarLocate open access versionFindings
  • Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
    Google ScholarFindings
  • Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv (2016)
    Google ScholarLocate open access versionFindings
  • Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM TOG (2016)
    Google ScholarLocate open access versionFindings
  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR (2017)
    Google ScholarFindings
  • Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
    Google ScholarFindings
  • Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015)
    Google ScholarFindings
  • Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
    Google ScholarFindings
  • Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networks for fast and accurate super-resolution. In: CVPR (2017)
    Google ScholarFindings
  • Lang, M., Wang, O., Aydin, T.O., Smolic, A., Gross, M.H.: Practical temporal consistency for image-based graphics applications. ACM TOG (2012)
    Google ScholarLocate open access versionFindings
  • Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image superresolution using a generative adversarial network. In: CVPR (2017)
    Google ScholarFindings
  • Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M.K.S., Yang, M.H.: Diverse imageto-image translation via disentangled representation. In: ECCV (2018)
    Google ScholarFindings
  • Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. ACM TOG (2004)
    Google ScholarLocate open access versionFindings
  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. In: NIPS (2017)
    Google ScholarFindings
  • Li, Y., Huang, J.B., Narendra, A., Yang, M.H.: Deep joint image filtering. In: ECCV (2016)
    Google ScholarFindings
  • Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR (2016)
    Google ScholarFindings
  • Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., SorkineHornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
    Google ScholarFindings
  • Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos. In: German Conference on Pattern Recognition (2016)
    Google ScholarLocate open access versionFindings
  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
    Google ScholarFindings
  • Videvo: https://www.videvo.net/
    Findings
  • Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: NIPS (2015)
    Google ScholarFindings
  • Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edge-aware filters. In: ICML (2015)
    Google ScholarFindings
  • Yan, Z., Zhang, H., Wang, B., Paris, S., Yu, Y.: Automatic photo adjustment using deep neural networks. ACM TOG (2016)
    Google ScholarLocate open access versionFindings
  • Yao, C.H., Chang, C.Y., Chien, S.Y.: Occlusion-aware video temporal consistency. In: ACM MM (2017)
    Google ScholarLocate open access versionFindings
  • Ye, G., Garces, E., Liu, Y., Dai, Q., Gutierrez, D.: Intrinsic video and applications. ACM TOG (2014)
    Google ScholarLocate open access versionFindings
  • Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
    Google ScholarFindings
  • Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
    Google ScholarFindings
  • Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn