AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We demonstrate the effectiveness of a stacked hourglass network for producing human pose estimates

Stacked Hourglass Networks For Human Pose Estimation

COMPUTER VISION - ECCV 2016, PT VIII, (2016): 483-499

Cited: 3775|Views395
EI

Abstract

This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision i...More

Code:

Data:

0
Introduction
  • A key step toward understanding people in images and video is accurate pose estimation.
  • A good pose estimation system must be robust to occlusion and severe deformation, successful on rare and novel poses, and invariant to changes in appearance due to factors like clothing and lighting.
  • Work tackles such difficulties using robust image features and sophisticated structured prediction [1,2,3,4,5,6,7,8,9]: the former is used to produce local interpretations, whereas the latter is used to infer a globally consistent pose
Highlights
  • A key step toward understanding people in images and video is accurate pose estimation
  • A good pose estimation system must be robust to occlusion and severe deformation, successful on rare and novel poses, and invariant to changes in appearance due to factors like clothing and lighting
  • Our work focuses solely on the task of keypoint localization of a single person’s pose from an RGB image
  • We explore several options for layer design in our network
  • We evaluate our network on two benchmark datasets, FLIC [1] and MPII Human Pose [21]
  • We demonstrate the effectiveness of a stacked hourglass network for producing human pose estimates
Conclusion
  • The authors demonstrate the effectiveness of a stacked hourglass network for producing human pose estimates.
  • The network handles a diverse and challenging set of poses with a simple mechanism for reevaluation and assessment of initial predictions.
  • Intermediate supervision is critical for training the network, working best in the context of stacked hourglass modules.
  • There still exist difficult cases not handled perfectly by the network, but overall the system shows robust performance to a variety of challenges including heavy occlusion and multiple people in close proximity
Tables
  • Table1: FLIC results (PCK@0.2)
  • Table2: Results on MPII Human Pose (PCKh@0.5)
Download tables as Excel
Related work
  • With the introduction of “DeepPose” by Toshev et al [24], research on human pose estimation began the shift from classic approaches [1,2,3,4,5,6,7,8,9] to deep networks. Toshev et al use their network to directly regress the x, y coordinates of joints. The work by Tompson et al [15] instead generates heatmaps by running an image through multiple resolution banks in parallel to simultaneously capture features at a variety of scales. Our network design largely builds off of their work, exploring how to capture information across scales and adapting their method for combining features across different resolutions.

    A critical feature of the method proposed by Tompson et al [15] is the joint use of a ConvNet and a graphical model. Their graphical model learns typical
Reference
  • Sapp, B., Taskar, B.: Modec: multimodal decomposable models for human pose estimation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3674–3681. IEEE (2013)
    Google ScholarLocate open access versionFindings
  • Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008. IEEE, pp. 1–8 (2008)
    Google ScholarLocate open access versionFindings
  • Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3487–3494. IEEE (2013)
    Google ScholarLocate open access versionFindings
  • Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3d human pose annotations. In: IEEE 12th International Conference on Computer Vision, 2009, pp. 1365–1372. IEEE (2009)
    Google ScholarLocate open access versionFindings
  • Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1465–1472. IEEE (2011)
    Google ScholarLocate open access versionFindings
  • Ramanan, D.: Learning to parse images of articulated objects. In: Advances in Neural Information Processing Systems, p. 134 (2006)
    Google ScholarLocate open access versionFindings
  • Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2013)
    Google ScholarLocate open access versionFindings
  • Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
    Google ScholarLocate open access versionFindings
  • Ladicky, L., Torr, P.H., Zisserman, A.: Human pose estimation using a joint pixelwise and part-wise formulation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3578–3585. IEEE (2013)
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
    Google ScholarLocate open access versionFindings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
    Google ScholarLocate open access versionFindings
  • Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (2015)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015 (2015)
    Google ScholarLocate open access versionFindings
  • Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2014)
    Google ScholarLocate open access versionFindings
  • Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)
    Google ScholarLocate open access versionFindings
  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
    Google ScholarLocate open access versionFindings
  • Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
    Google ScholarLocate open access versionFindings
  • Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
    Google ScholarLocate open access versionFindings
  • Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1347–1355. IEEE (2015)
    Google ScholarLocate open access versionFindings
  • Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693. IEEE (2014)
    Google ScholarLocate open access versionFindings
  • Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference (2010). doi:10.5244/C.24.12
    Locate open access versionFindings
  • Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
    Google ScholarLocate open access versionFindings
  • Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1653–1660. IEEE (2014)
    Google ScholarLocate open access versionFindings
  • Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems (NIPS) (2014)
    Google ScholarLocate open access versionFindings
  • Ramakrishna, V., Munoz, D., Hebert, M., Andrew Bagnell, J., Sheikh, Y.: Pose machines: articulated pose estimation via inference machines. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 33–47. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10605-2 3
    Locate open access versionFindings
  • Hu, P., Ramanan, D.: Bottom-up and top-down reasoning with hierarchical rectified gaussians. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016)
    Google ScholarLocate open access versionFindings
  • Jain, A., Tompson, J., LeCun, Y., Bregler, C.: MoDeep: a deep learning framework using motion features for human pose estimation. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 302–315. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16808-1 21
    Locate open access versionFindings
  • Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)
    Google ScholarLocate open access versionFindings
  • Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1913–1921 (2015)
    Google ScholarLocate open access versionFindings
  • Chen, X., Yuille, A.L.: Parsing occluded people by flexible compositions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3945–3954 (2015)
    Google ScholarLocate open access versionFindings
  • Oliveira, G.L., Valada, A., Bollen, C., Burgard, W., Brox, T.: Deep learning for human part discovery in images. In: IEEE International Conference on Robotics and Automation (ICRA) (2016)
    Google ScholarLocate open access versionFindings
  • Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)
    Google ScholarLocate open access versionFindings
  • Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
    Google ScholarLocate open access versionFindings
  • Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
    Google ScholarLocate open access versionFindings
  • Pinheiro, P., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 82–90 (2014)
    Google ScholarLocate open access versionFindings
  • Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
    Google ScholarLocate open access versionFindings
  • Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (ICLR) (2016)
    Google ScholarLocate open access versionFindings
  • Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. In: International Conference on Learning Representations (ICLR) (2013)
    Google ScholarLocate open access versionFindings
  • Bertasius, G., Shi, J., Torresani, L.: Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4380–4389 (2015)
    Google ScholarLocate open access versionFindings
  • Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)
    Google ScholarLocate open access versionFindings
  • Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)
    Google ScholarLocate open access versionFindings
  • Zhao, J., Mathieu, M., Goroshin, R., Lecun, Y.: Stacked what-where auto-encoders. arXiv preprint arXiv:1506.02351 (2015)
    Findings
  • Rematas, K., Ritschel, T., Fritz, M., Gavves, E., Tuytelaars, T.: Deep reflectance maps. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015 (2015)
    Google ScholarLocate open access versionFindings
  • Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
    Findings
  • Yang, J., Reed, S.E., Yang, M.H., Lee, H.: Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1099–1107. Curran Associates, Inc. (2015)
    Google ScholarLocate open access versionFindings
  • Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015)
    Google ScholarLocate open access versionFindings
  • Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
    Google ScholarLocate open access versionFindings
  • Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning (2012)
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn