AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have presented a drop-in replacement for rectangular cropping and root centering that removes locationdependent perspective effects

PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021)

Cited by: 0|Views23
Full Text
Bibtex
Weibo

Abstract

Local processing is an essential feature of CNNs and other neural network architectures - it is one of the reasons why they work so well on images where relevant information is, to a large extent, local. However, perspective effects stemming from the projection in a conventional camera vary for different global positions in the image. W...More

Code:

Data:

0
Introduction
  • Convolutional neural networks (CNNs) have proven highly effective for image-based prediction tasks because of their translation invariance and the locality of the computation they perform.
  • Applying the same convolutional filter at the topleft image corner and at the bottom-right one will yield different features, even though the pose is the same
  • This is typically tackled by increasing the width and depth of the network, so that different filters and layers can model the same 3D pose perspective-distorted in different ways.
  • Two-stage approaches that lift 3D pose from 2D pose estimates using multilayer perceptrons (MLPs) [2, 28, 8, 23, 22, 34, 31, 26] rely on translational invariance by centering the 2D pose on a root joint, thereby losing important cues on perspective distortion too
Highlights
  • Convolutional neural networks (CNNs) have proven highly effective for image-based prediction tasks because of their translation invariance and the locality of the computation they perform
  • Two-stage approaches that lift 3D pose from 2D pose estimates using multilayer perceptrons (MLPs) [2, 28, 8, 23, 22, 34, 31, 26] rely on translational invariance by centering the 2D pose on a root joint, thereby losing important cues on perspective distortion too
  • We demonstrate the benefits of our Perspective Crop Layers (PCLs) for 3D human pose estimation of both rigid objects and articulated people
  • We evaluate the improvements brought about by PCL on the task of 3D human pose estimation from either images or 2D keypoints, and show that they hold for neural networks of diverse complexity
  • We report the percentage of correct keypoints (PCK), encoding the proportion of joints whose distance to the ground truth is less than a threshold, using thresholds of 50 and 100 millimeters
  • We have presented a drop-in replacement for rectangular cropping and root centering that removes locationdependent perspective effects
Methods
  • The authors evaluate the improvements brought about by PCL on the task of 3D human pose estimation from either images or 2D keypoints, and show that they hold for neural networks of diverse complexity.
  • The benefits of PCL for the 2D to 3D lifting task on Human 3.6 Million dataset [12] and MPI-INF-3DHP dataset [25] are shown qualitatively in both Fig. 6 and in additional experiments in the supplemental video.
  • The authors integrate PCL into the three neural network architectures for 3D pose estimation discussed below, and compare the resulting networks with the original ones.
  • To ensure a fair comparison, the authors scale the 2D input of the baseline by the crop scales s that are used in PCL
Results
  • PCLs yield a consistent boost in performance, of 2 − 10% on average and up to 25% at the image boundary where perspective effects are strongest.
  • For H3.6M, MLP+PCL achieves an MPJPE of 67.0 mm vs 69.8 mm MPJPE of the MLP+RC baseline [21] when using 2D detections from [41, 46] as input, a 4% improvement.
  • As shown in the last four rows of Table 1, when using images as input to a ResNet regressing 3D pose on H3.6M, the baselines achieves an MPJPE of 96.5 mm, while the model with PCL yields 94.1 mm, a 2.5% reduction
Conclusion
  • The authors have presented a drop-in replacement for rectangular cropping and root centering that removes locationdependent perspective effects.
  • It is fully differentiable, lends itself to end-to-end training, is efficient to compute, does not impose additional network parameters, and the empirical evaluation demonstrates significant improvements for 3D pose estimation.
  • The strong influence of perspective effects on the reconstruction accuracy is widely overlooked in the 3D pose reconstruction literature and these improvements are observed irrespective of the network architecture.
  • PCL is an important contribution to pushing state-of-the-art 3D reconstruction methods further
Summary
  • Introduction:

    Convolutional neural networks (CNNs) have proven highly effective for image-based prediction tasks because of their translation invariance and the locality of the computation they perform.
  • Applying the same convolutional filter at the topleft image corner and at the bottom-right one will yield different features, even though the pose is the same
  • This is typically tackled by increasing the width and depth of the network, so that different filters and layers can model the same 3D pose perspective-distorted in different ways.
  • Two-stage approaches that lift 3D pose from 2D pose estimates using multilayer perceptrons (MLPs) [2, 28, 8, 23, 22, 34, 31, 26] rely on translational invariance by centering the 2D pose on a root joint, thereby losing important cues on perspective distortion too
  • Objectives:

    The authors' goal is to design a crop operation such that the optical center of the virtual camera is always at the center of the patch, which makes perspective distortion independent from image location.
  • The authors aim to find camera parameters such that mapping a pixel from the original image to the cropped patch can be done by multiplying an image coordinate in homogeneous coordinates, (u, v, 1) , by the matrix
  • Methods:

    The authors evaluate the improvements brought about by PCL on the task of 3D human pose estimation from either images or 2D keypoints, and show that they hold for neural networks of diverse complexity.
  • The benefits of PCL for the 2D to 3D lifting task on Human 3.6 Million dataset [12] and MPI-INF-3DHP dataset [25] are shown qualitatively in both Fig. 6 and in additional experiments in the supplemental video.
  • The authors integrate PCL into the three neural network architectures for 3D pose estimation discussed below, and compare the resulting networks with the original ones.
  • To ensure a fair comparison, the authors scale the 2D input of the baseline by the crop scales s that are used in PCL
  • Results:

    PCLs yield a consistent boost in performance, of 2 − 10% on average and up to 25% at the image boundary where perspective effects are strongest.
  • For H3.6M, MLP+PCL achieves an MPJPE of 67.0 mm vs 69.8 mm MPJPE of the MLP+RC baseline [21] when using 2D detections from [41, 46] as input, a 4% improvement.
  • As shown in the last four rows of Table 1, when using images as input to a ResNet regressing 3D pose on H3.6M, the baselines achieves an MPJPE of 96.5 mm, while the model with PCL yields 94.1 mm, a 2.5% reduction
  • Conclusion:

    The authors have presented a drop-in replacement for rectangular cropping and root centering that removes locationdependent perspective effects.
  • It is fully differentiable, lends itself to end-to-end training, is efficient to compute, does not impose additional network parameters, and the empirical evaluation demonstrates significant improvements for 3D pose estimation.
  • The strong influence of perspective effects on the reconstruction accuracy is widely overlooked in the 3D pose reconstruction literature and these improvements are observed irrespective of the network architecture.
  • PCL is an important contribution to pushing state-of-the-art 3D reconstruction methods further
Tables
  • Table1: Shown are the reported MPJPE in millimeters as well as the PCK for 2D to 3D keypoint lifting tests performed on H3.6M. The reported mean and standard deviation is computed over three runs wit varying random seed. For MPJPE, lower values are better and for PCK, higher values are better. We can see from the table that our method significantly outperforms the baselines that do not use PCL. We bold the best performing models in each category
  • Table2: Temporal CNN tests, computed as the MPJPE over two runs with varying seed on H3.6M. While the baseline performs the best using the original camera, it is unable to generalize to new camera settings. The PCL equipped version strikes the best compromise
  • Table3: Shown are the reported MPJPE in millimeters for all tests conducted on the Cube Dataset and its variation. For this metric, lower values are better. We can see that our method produces more accurate results while at the same time generalizing better to unseen instances
Download tables as Excel
Related work
  • In this section, we discuss existing ways of handling image distortions and review the existing attention window mechanisms upon which PCLs are built.

    Handling perspective effects. Many works sidestep perspective effects by training and testing on synthetic renderings [1, 7, 49, 48] or real images [10, 49] where the object of interest is centered manually. However, these methods are not applicable to natural images where the object can be at an arbitrary location. If the object location is known in advance, perspective distortion can be undone in a preprocessing stage. For instance, [24] propose to rotate locally inferred 3D poses back to the camera frame. This strategy has later been adopted by [14], but neither of these works undistorts the input images or input 2D pose. [38, 37] apply an image correction, however, only approximating the homography with an affine transformation. In other words, the above-mentioned approaches neither model the perspective correction geometrically accurately nor formulate it as a differentiable layer. However, differentiability is an important prerequisite for end-to-end training on natural images, particularly for unsupervised approaches, that deal with unknown object locations.
Funding
  • PCLs yield a consistent boost in performance, of 2 − 10% on average and up to 25% at the image boundary where perspective effects are strongest
  • For H3.6M, MLP+PCL achieves an MPJPE of 67.0 mm vs. 69.8 mm MPJPE of the MLP+RC baseline [21] when using 2D detections from [41, 46] as input, a 4% improvement
  • As shown in the last four rows of Table 1, when using images as input to a ResNet regressing 3D pose on H3.6M, the baselines achieves an MPJPE of 96.5 mm, while our model with PCL yields 94.1 mm, a 2.5% reduction
Reference
  • Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 2
    Findings
  • Ching-Hang Chen and Deva Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7035–7043, 2017. 2
    Google ScholarLocate open access versionFindings
  • Taco Cohen, Mario Geiger, Jonas Kohler, and Max Welling. Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893, 2017. 2
    Findings
  • Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 518–533, 2018. 3
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017. 2
    Google ScholarLocate open access versionFindings
  • Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan Kankanhalli, and Weidong Geng. Marker-less 3D human motion capture with monocular image sequence and height-maps. In European Conference on Computer Vision (ECCV), pages 20–3Springer, 2016. 1
    Google ScholarLocate open access versionFindings
  • H. Fan, H. Su, and L. Guibas. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Conference on Computer Vision and Pattern Recognition, 2012
    Google ScholarLocate open access versionFindings
  • Haoshu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the AAAI Conference on Artificial Intelligence, pages 6821– 682AAAI Press, 2018. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016. 5
    Google ScholarLocate open access versionFindings
  • G.E. Hinton, A. Krizhevsky, and S.D. Wang. Transforming Auto-Encoders. In International Conference on Artificial Neural Networks, pages 44–51, 2011. 2
    Google ScholarLocate open access versionFindings
  • Yannick Hold-Geoffroy, Kalyan Sunkavalli, Jonathan Eisenmann, Matthew Fisher, Emiliano Gambaretto, Sunil Hadap, and Jean-Francois Lalonde. A perceptual measure for deep single image camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2354–2363, 2018. 2, 8
    Google ScholarLocate open access versionFindings
  • C. Ionescu, I. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014. 5, 6
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. In Advances in Neural Information Processing Systems, pages 2017– 2025, 2015. 3, 5
    Google ScholarLocate open access versionFindings
  • A. Kanazawa, S. Tulsiani, A. Efros, and J. Malik. Learning Category-Specific Mesh Reconstruction from Image Collections. European Conference on Computer Vision, 2018. 2
    Google ScholarLocate open access versionFindings
  • Renata Khasanova and Pascal Frossard. Graph-based classification of omnidirectional images. In Proceedings of the IEEE International Conference on Computer Vision, pages 869–878, 2017. 3
    Google ScholarLocate open access versionFindings
  • D.P. Kingma and J. Ba. Adam: A Method for Stochastic Optimisation. In International Conference on Learning Representations, 2015. 6
    Google ScholarLocate open access versionFindings
  • Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, Chi-Keung Tang, and Kwang-Ting Cheng. Cascaded deep monocular 3d human pose estimation with evolutionary training data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 1
    Google ScholarLocate open access versionFindings
  • Chen-Hsuan Lin and Simon Lucey. Inverse compositional spatial transformer networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2568–2576, 2017. 3
    Google ScholarLocate open access versionFindings
  • Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pages 9605–9616, 2018. 3
    Google ScholarLocate open access versionFindings
  • Chenxu Luo, Xiao Chu, and Alan L. Yuille. Orinet: A fully convolutional network for 3d human pose estimation. In British Machine Vision Conference (BMVC), page 92, 2018. 1
    Google ScholarLocate open access versionFindings
  • J. Martinez, R. Hossain, J. Romero, and J.J. Little. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In International Conference on Computer Vision, 2017. 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3d human pose estimation. In International Conference on Computer Vision (ICCV), 2017. 2
    Google ScholarLocate open access versionFindings
  • Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In International Conference on 3D Vision (3DV). IEEE, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. In International Conference on 3D Vision, 2017. 2
    Google ScholarLocate open access versionFindings
  • Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017. 5, 6
    Google ScholarLocate open access versionFindings
  • Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. XNect: Real-time multi-person 3D motion capture with a single RGB camera. volume 39, 2020. 2
    Google ScholarFindings
  • Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. In ACM Transactions on Graphics, volume 36, 7 2017. 1
    Google ScholarLocate open access versionFindings
  • Francesc Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2
    Google ScholarLocate open access versionFindings
  • Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prendergast. 3d human pose estimation with 2d marginal heatmaps. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1477–1485. IEEE, 2019. 6
    Google ScholarLocate open access versionFindings
  • Sungheon Park, Jihye Hwang, and Nojun Kwak. 3d human pose estimation using convolutional neural networks with 2d pose information. In European Conference on Computer Vision (ECCV), pages 156–169, 2016. 1
    Google ScholarLocate open access versionFindings
  • Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7307–7316, 2018. 2
    Google ScholarLocate open access versionFindings
  • Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. Conference on Computer Vision and Pattern Recognition (CVPR), pages 1263– 1272, 2017. 1
    Google ScholarLocate open access versionFindings
  • Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019. 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Mir Rayat Imtiaz Hossain and James J. Little. Exploiting temporal information for 3d human pose estimation. In European Conference on Computer Vision (ECCV), 2018. 2
    Google ScholarLocate open access versionFindings
  • Adria Recasens, Petr Kellnhofer, Simon Stent, Wojciech Matusik, and Antonio Torralba. Learning to zoom: a saliencybased sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018. 8
    Google ScholarLocate open access versionFindings
  • Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019. 6
    Findings
  • H. Rhodin, M. Salzmann, and P. Fua. Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation. In European Conference on Computer Vision, 2018. 2
    Google ScholarLocate open access versionFindings
  • H. Rhodin, J. Spoerri, I. Katircioglu, V. Constantin, F. Meyer, E. Moeller, M. Salzmann, and P. Fua. Learning Monocular 3D Human Pose Estimation from Multi-View Images. In Conference on Computer Vision and Pattern Recognition, 2018. 2
    Google ScholarLocate open access versionFindings
  • Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. LCR-Net: Localization-Classification-Regression for Human Pose. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1216–1224, Honolulu, United States, July 2017. IEEE. 1
    Google ScholarLocate open access versionFindings
  • Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 2602–2611, 2017. 1
    Google ScholarLocate open access versionFindings
  • Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. Structured prediction of 3d human pose with deep neural networks. In British Machine Vision Conference (BMVC), 2016. 1
    Google ScholarLocate open access versionFindings
  • Denis Tome, Chris Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 1
    Google ScholarLocate open access versionFindings
  • Scott Workman, Connor Greenwell, Menghua Zhai, Ryan Baltenberger, and Nathan Jacobs. Deepfocal: A method for direct focal length estimation. In 2015 IEEE International Conference on Image Processing (ICIP), pages 1369–1373. IEEE, 2015. 2, 8
    Google ScholarLocate open access versionFindings
  • Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018. 6
    Google ScholarLocate open access versionFindings
  • Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and Wenjun Zhang. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 1
    Google ScholarLocate open access versionFindings
  • Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016. 2, 3
    Google ScholarLocate open access versionFindings
  • J. Yang, S.E. Reed, M.-H. Yang, and H. Lee. WeaklySupervised Disentangling with Recurrent Transformations for 3D View Synthesis. In Advances in Neural Information Processing Systems, pages 1099–1107, 2015. 2
    Google ScholarLocate open access versionFindings
  • Yajie Zhao, Zeng Huang, Tianye Li, Weikai Chen, Chloe LeGendre, Xinglei Ren, Ari Shapiro, and Hao Li. Learning perspective undistortion of portraits. In Proceedings of the IEEE International Conference on Computer Vision, pages 7849–7859, 2019. 8
    Google ScholarLocate open access versionFindings
  • Kun Zhou, Xiaoguang Han, Nianjuan Jiang, Kui Jia, and Jiangbo Lu. Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2344–2353, 2019. 1
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科