AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our paper addressed the challenging problem of 3D human pose estimation from a single color image

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose.

CVPR, (2017)

Cited by: 464|Views48
EI
Full Text
Bibtex
Weibo

Abstract

This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization and a subsequent optimization step to recover 3D pose. In ...More

Code:

Data:

0
Introduction
  • Estimating the full-body 3D pose of a human from a single monocular image is an open challenge, which has garnered significant attention since the early days of computer vision [18].
  • Given its ill-posed nature, researchers have generally approached 3D human pose estimation in simplified settings, such as assuming background subtraction is feasible [1], relying on groundtruth 2D joint locations to estimate 3D pose [26, 43], employing additional camera views [7, 15], and capitalizing on temporal consistency to improve upon single frame predictions [38, 3]
  • This di- Image ConvNet. Volumetric Output versity of assumptions and additional information sources exemplifies the challenge presented by the task.
  • The authors show that ConvNets are able to provide much richer information than 2D joint locations
Highlights
  • Estimating the full-body 3D pose of a human from a single monocular image is an open challenge, which has garnered significant attention since the early days of computer vision [18]
  • Given its ill-posed nature, researchers have generally approached 3D human pose estimation in simplified settings, such as assuming background subtraction is feasible [1], relying on groundtruth 2D joint locations to estimate 3D pose [26, 43], employing additional camera views [7, 15], and capitalizing on temporal consistency to improve upon single frame predictions [38, 3]
  • Convolutional Network (ConvNet) are used only for 2D joint localization and 3D poses are generated during a postprocessing optimization step
  • Our proposed approach achieves state-of-the-art results on standard benchmarks, surpassing both ConvNetonly and hybrid approaches that employ ConvNets for 2D pose estimation, with a relative error reduction that exceeds 30% on average;
  • Our paper addressed the challenging problem of 3D human pose estimation from a single color image
  • Departing from recent ConvNet approaches, we cast the problem as 3D keypoint localization in a discretized space around the subject. We integrated this volumetric representation with a coarse-to-fine supervision scheme to deal with the high dimensionality and enable iterative processing
Results
  • For Human3.6M, most approaches report the per joint 3D error, which is the average Euclidean distance of the estimated joints to the groundtruth
  • This is done after aligning the root joints of the estimated and groundtruth 3D pose.
  • An alternative metric, which is used by some methods to report results on Human3.6M and HumanEva-The author is the reconstruction error
  • It is defined as the per joint 3D error up to a similarity transformation.
  • The root joints are aligned to resolve the depth ambiguity
Conclusion
  • Zhou et al [44] 91.83 102.41 Ours 67.38 Greeting PhoningOur paper addressed the challenging problem of 3D human pose estimation from a single color image.
  • Departing from recent ConvNet approaches, the authors cast the problem as 3D keypoint localization in a discretized space around the subject.
  • The authors integrated this volumetric representation with a coarse-to-fine supervision scheme to deal with the high dimensionality and enable iterative processing.
  • The authors used the volumetric representation within a decoupled architecture, making it of practical use for in-the-wild images even when end-to-end training is not feasible.
Summary
  • Introduction:

    Estimating the full-body 3D pose of a human from a single monocular image is an open challenge, which has garnered significant attention since the early days of computer vision [18].
  • Given its ill-posed nature, researchers have generally approached 3D human pose estimation in simplified settings, such as assuming background subtraction is feasible [1], relying on groundtruth 2D joint locations to estimate 3D pose [26, 43], employing additional camera views [7, 15], and capitalizing on temporal consistency to improve upon single frame predictions [38, 3]
  • This di- Image ConvNet. Volumetric Output versity of assumptions and additional information sources exemplifies the challenge presented by the task.
  • The authors show that ConvNets are able to provide much richer information than 2D joint locations
  • Objectives:

    The authors' goal is to demonstrate the benefit of predicting the 3D pose directly from image features versus using 2D locations as an intermediate representation.
  • Results:

    For Human3.6M, most approaches report the per joint 3D error, which is the average Euclidean distance of the estimated joints to the groundtruth
  • This is done after aligning the root joints of the estimated and groundtruth 3D pose.
  • An alternative metric, which is used by some methods to report results on Human3.6M and HumanEva-The author is the reconstruction error
  • It is defined as the per joint 3D error up to a similarity transformation.
  • The root joints are aligned to resolve the depth ambiguity
  • Conclusion:

    Zhou et al [44] 91.83 102.41 Ours 67.38 Greeting PhoningOur paper addressed the challenging problem of 3D human pose estimation from a single color image.
  • Departing from recent ConvNet approaches, the authors cast the problem as 3D keypoint localization in a discretized space around the subject.
  • The authors integrated this volumetric representation with a coarse-to-fine supervision scheme to deal with the high dimensionality and enable iterative processing.
  • The authors used the volumetric representation within a decoupled architecture, making it of practical use for in-the-wild images even when end-to-end training is not feasible.
Tables
  • Table1: Coordinate versus volume regression on Human3.6M. The mean per joint error (mm) across all actions and subjects in the test set are shown
  • Table2: Comparison of the Naive Stacking (left) versus Coarse-to-Fine (right) approaches on Human3.6M. The column Li denotes the z-dimension resolution for the supervision provided at the i-th hourglass component (empty if the network has less than i components). We report mean per joint errors (mm) following the standard protocol
  • Table3: Comparison of our coarse-to-fine network using 2D heatmaps for intermediate supervision (Coarse-to-Fine) versus a decoupled network with a volumetric representation (Decoupled). The reported results are for the six classes of Human3.6M with the largest difference between the two approaches, as well as the average across all actions
  • Table4: Quantitative comparison on Human3.6M. The numbers are the average 3D joint error (mm). Baseline numbers are taken from the respective papers. Note, several approaches use video for prediction rather than a single frame [<a class="ref-link" id="c36" href="#r36">36</a>, <a class="ref-link" id="c45" href="#r45">45</a>, <a class="ref-link" id="c10" href="#r10">10</a>]
  • Table5: Quantitative comparison on Human3.6M among approaches that report reconstruction error (mm). Baseline numbers are taken from the respective papers
  • Table6: Quantitative results on HumanEva-I. The numbers are the mean reconstruction errors (mm). Baseline numbers are taken from the respective papers
  • Table7: Quantitative results on KTH Football II. The numbers are the mean PCP scores (the higher the better). Baseline numbers are taken from the respective papers. We indicate how many cameras each approach uses, and highlight the best performance for single view approaches
Download tables as Excel
Related work
  • The literature on 3D human pose estimation is vast with approaches addressing the problem in a variety of settings. Here, we survey works that are most relevant to ours with a focus on ConvNet-based approaches; we refer the reader to a recent survey [29] for a more complete literature review.

    The majority of recent ConvNet-only approaches cast 3D pose estimation as a coordinate regression task, with the target output being the spatial x, y, z coordinates of the human joints with respect to a known root joint, such as the pelvis. Li and Chan [19] pretrain their network with maps for 2D joint classification. Tekin et al [35] include a pretrained autoencoder within the network to enforce structural constraints on the output. Ghezelghieh et al [13] employ viewpoint prediction as a side task to provide the network with global joint configuration information. Zhou et al [44] embed a kinematic model to guarantee the validity of the regressed pose. Park et al [22] concatenate the 2D joint predictions with image features to improve 3D joint localization. Tekin et al [36] include temporal information in the joint predictions by extracting spatiotemporal features from a sequence of frames. In contrast to all these approaches, we adopt a volumetric representation of the human pose, and regress the per voxel likelihood for each joint separately. This proves to have significant advantages for the network performance and provides a richer output compared to the low-dimensional vector of joint coordinates.
Funding
  • We gratefully appreciate support through the following grants: NSF-DGE-0966142 (IGERT), NSF-IIP-1439681 (I/UCRC), NSF-IIS-1426840, ARL MAST-CTA W911NF-08-2-0004, ARL RCTA W911NF-10-2-0016, ONR N00014-17-1-2093, an ONR STTR (Robotics Research), NSERC Discovery, and the DARPA FLA program
Study subjects and analysis
subjects: 11
Additionally, qualitative results are presented on the MPII human pose dataset [2], since no 3D groundtruth is available. Human3.6M: It contains video of 11 subjects performing a variety of actions, such as “Walking”, “Sitting” and. “Phoning”

Reference
  • A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images. PAMI, 28(1):44–58, 2006. 1
    Google ScholarLocate open access versionFindings
  • M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014. 5
    Google ScholarLocate open access versionFindings
  • M. Andriluka, S. Roth, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In CVPR, 2010. 1
    Google ScholarLocate open access versionFindings
  • V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic. 3D pictorial structures for multiple human pose estimation. In CVPR, 2013
    Google ScholarLocate open access versionFindings
  • L. Bo and C. Sminchisescu. Twin Gaussian processes for structured prediction. IJCV, 87(1-2):28–52, 2010. 7
    Google ScholarLocate open access versionFindings
  • F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2011, 3, 7
    Google ScholarLocate open access versionFindings
  • M. Burenius, J. Sullivan, and S. Carlsson. 3D pictorial structures for multiple view articulated pose estimation. In CVPR, 2013. 1, 3, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • W. Chen, H. Wang, Y. Li, H. Su, D. Lischinsk, D. CohenOr, B. Chen, et al. Synthesizing training images for boosting human 3D pose estimation. In 3DV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. Kankanhalli, and W. Geng. Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV, 2016. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • J. Gall, B. Rosenhahn, T. Brox, and H.-P. Seidel. Optimization and filtering for human motion capture. IJCV, 87(1):75– 92, 2010. 3
    Google ScholarLocate open access versionFindings
  • M. F. Ghezelghieh, R. Kasturi, and S. Sarkar. Learning camera viewpoint using CNN to improve 3D body pose estimation. In 3DV, 2016. 2, 3, 8
    Google ScholarLocate open access versionFindings
  • C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7):1325–1339, 205, 8
    Google ScholarLocate open access versionFindings
  • V. Kazemi, M. Burenius, H. Azizpour, and J. Sullivan. Multiview body part recognition with random forests. In BMVC, 2013. 1, 3, 5
    Google ScholarLocate open access versionFindings
  • I. Kostrikov and J. Gall. Depth sweep regression forests for estimating 3D human pose from images. In BMVC, 2014. 3, 5, 7
    Google ScholarLocate open access versionFindings
  • C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015. 4
    Google ScholarLocate open access versionFindings
  • H.-J. Lee and Z. Chen. Determination of 3D human body postures from a single view. CVGIP, 30(2):148–168, 1985. 1
    Google ScholarLocate open access versionFindings
  • S. Li and A. B. Chan. 3D human pose estimation from monocular images with deep convolutional neural network. In ACCV, 2014. 1, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV, 2015. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 2, 3, 4, 6
    Google ScholarLocate open access versionFindings
  • S. Park, J. Hwang, and N. Kwak. 3D human pose estimation using convolutional neural networks with 2D pose information. In ECCVW, 2016. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Harvesting multiple views for marker-less 3D human pose annotations. CVPR, 2017. 3, 4
    Google ScholarLocate open access versionFindings
  • T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In ICCV, 2015. 4
    Google ScholarLocate open access versionFindings
  • I. Radwan, A. Dhall, and R. Goecke. Monocular image 3D human pose estimation under self-occlusion. In ICCV, 2013. 7
    Google ScholarLocate open access versionFindings
  • V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3D human pose from 2D image landmarks. In ECCV, 2012. 1
    Google ScholarLocate open access versionFindings
  • G. Rogez and C. Schmid. MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS, 2016. 1, 2, 3, 7
    Google ScholarLocate open access versionFindings
  • M. Sanzari, V. Ntouskos, and F. Pirri. Bayesian image based 3D pose estimation. In ECCV, 2016. 7
    Google ScholarLocate open access versionFindings
  • N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris. 3D human pose estimation: A review of the literature and analysis of covariates. CVIU, 152:1–20, 2016. 2
    Google ScholarLocate open access versionFindings
  • L. Sigal, A. O. Balan, and M. J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1-2):4–27, 2010. 5
    Google ScholarLocate open access versionFindings
  • L. Sigal, M. Isard, H. W. Haussecker, and M. J. Black. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. IJCV, 98(1):15–48, 2012. 3
    Google ScholarLocate open access versionFindings
  • E. Simo-Serra, A. Quattoni, C. Torras, and F. MorenoNoguer. A Joint Model for 2D and 3D Pose Estimation from a Single Image. In CVPR, 2013. 7
    Google ScholarLocate open access versionFindings
  • E. Simo-Serra, A. Ramisa, G. Alenya, C. Torras, and F. Moreno-Noguer. Single image 3D human pose estimation from noisy observations. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 4
    Google ScholarFindings
  • B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3D human pose with deep neural networks. In BMVC, 2016. 1, 2, 6, 8
    Google ScholarLocate open access versionFindings
  • B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3D body poses from motion compensated sequences. In CVPR, 2016. 2, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014. 4
    Google ScholarLocate open access versionFindings
  • R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with gaussian process dynamical models. In CVPR, 2006. 1
    Google ScholarLocate open access versionFindings
  • C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of 3D human poses from a single image. In CVPR, 2014. 7
    Google ScholarLocate open access versionFindings
  • S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3D interpreter network. In ECCV, 2016. 3, 4
    Google ScholarLocate open access versionFindings
  • H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dualsource approach for 3D pose estimation from a single image. In CVPR, 2016. 5, 7
    Google ScholarLocate open access versionFindings
  • X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis. 3D shape estimation from 2D landmarks: A convex relaxation approach. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic pose regression. In ECCVW, 2016. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • X. Zhou, M. Zhu, S. Leonardos, K. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR, 2016. 1, 2, 5, 7, 8
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科