AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce Pr-View-Invariant Pose Embeddings, an approach to learning probabilistic view-invariant embeddings from 2D pose keypoints

View-Invariant Probabilistic Embedding for Human Pose

european conference on computer vision, pp.53-70, (2020)

Cited by: 6|Views126
Full Text
Bibtex
Weibo

Abstract

Depictions of similar human body configurations can vary with changing viewpoints. Using only 2D information, we would like to enable vision algorithms to recognize similarity in human body poses across multiple views. This ability is useful for analyzing body movements and human behaviors in images and videos. In this paper, we propose...More

Code:

Data:

0
Introduction
  • When the authors represent three dimensional (3D) human bodies in two dimensions (2D), the same human pose can appear different across camera views.
  • There can be significant visual variations from a change in viewpoint due to changing relative depth of body parts and self-occlusions.
  • Despite these variations, humans have the ability to recognize similar 3D human body poses in images and videos.
  • Humans have the ability to recognize similar 3D human body poses in images and videos
  • This ability is useful for computer vision tasks where changing viewpoints should not change the labels of the task.
Highlights
  • When we represent three dimensional (3D) human bodies in two dimensions (2D), the same human pose can appear different across camera views
  • As illustrated in Figure 1, we explore whether view invariance of human bodies can be achieved from 2D poses alone, without predicting 3D pose
  • We introduce Pr-View-Invariant Pose Embeddings (VIPE), an approach to learning probabilistic view-invariant embeddings from 2D pose keypoints
  • Our experiments suggest that input 2D keypoints alone are sufficient to achieve view invariant properties in the embedding space, without having to explicitly predict 3D pose
  • We demonstrate that our probabilistic embeddings learn to capture input ambiguity, which can be useful for measuring uncertainty in downstream tasks
  • PrVIPE is compact with a simple architecture, and in addition to cross-view retrieval, our embeddings can be applied to other human pose related tasks
Methods
  • The authors demonstrate the performance of the model through pose retrieval across different camera views.
  • The authors iterate through all camera pairs in the dataset as query and index.
  • The authors train on a subset of the Human3.6M [14] dataset.
  • The authors present quantitative and qualitative results on the Human3.6M hold-out set and another dataset (MPI-INF-3DHP [27]) unseen during training.
  • The authors present qualitative results on MPII Human Pose [2], for which 3D groundtruth is not available
Results
  • In the second and the third row, the retrieval confidence is lower for 3DHP
  • This is likely because there are new poses and views unseen during training, which has the nearest neighbor slightly further away in the embedding space.
  • The rightmost pair on row 2 shows that the model can retrieve poses with large differences in roll angle, which is not present in the training set.
  • The results suggest that performance of existing 2D keypoint detectors, such as [32], is sufficient to train pose embedding models to achieve the view-invariant property in diverse images
Conclusion
  • The authors introduce Pr-VIPE, an approach to learning probabilistic view-invariant embeddings from 2D pose keypoints.

    The authors' experiments suggest that input 2D keypoints alone are sufficient to achieve view invariant properties in the embedding space, without having to explicitly predict 3D pose.
  • The authors introduce Pr-VIPE, an approach to learning probabilistic view-invariant embeddings from 2D pose keypoints.
  • The authors' experiments suggest that input 2D keypoints alone are sufficient to achieve view invariant properties in the embedding space, without having to explicitly predict 3D pose.
  • The authors demonstrate that the probabilistic embeddings learn to capture input ambiguity, which can be useful for measuring uncertainty in downstream tasks.
  • PrVIPE is compact with a simple architecture, and in addition to cross-view retrieval, the embeddings can be applied to other human pose related tasks.
  • The authors hope that the work can contribute towards future studies in recognizing human poses and body motions
Tables
  • Table1: Comparison of cross-view pose retrieval results on H3.6M. ∗ indicates that normalization and Procrustes alignment are performed on query-index pairs
  • Table2: Comparison of cross-view pose retrieval results on 3DHP with chest-level cameras and all cameras. ∗ indicates that normalization and Procrustes alignment are performed on query-index pairs
  • Table3: Additional ablation study results of Pr-VIPE on H3.6M with the number of samples K and margin parameter β
  • Table4: Additional ablation study results of Pr-VIPE on H3.6M and 3DHP using different rotation thresholds for keypoint augmentation. The angle threshold for azimuth is always ±180◦ and the angle thresholds in the table are for elevation and roll. The row for w/o aug. corresponds to Pr-VIPE without augmentation
  • Table5: Additional ablation study results of Pr-VIPE on H3.6M with different NP-MPJPE threshold κ for training and evaluation
Download tables as Excel
Related work
  • Metric Learning We are working to understand similarity in human poses across views. Most works that aim to capture similarity between inputs generally apply techniques from metric learning. Objectives such as contrastive loss (based on pair matching) [4, 9, 29] and triplet loss (based on tuple ranking) [45, 40, 46, 10] are often used to push together/pull apart similar/dissimilar examples in embedding space.

    The number of possible training tuples increases exponentially with respect to the number of samples in the tuple, and not all combinations are equally informative. To find informative training tuples, various mining strategies are proposed [40, 47, 30, 10]. In particular, semi-hard triplet mining has been widely used [40, 47, 33]. This mining method finds negative examples that are fairly hard as to be informative but not too hard for the model. The hardness of a negative sample is based on its embedding distance to the anchor. Commonly, this distance is the Euclidean distance [45, 46, 40, 10], but any differentiable distance function could be applied [10]. [13, 15] show that alternative distance metrics also work for image and object retrieval.
Funding
  • We see that PrVIPE is able to achieve a higher accuracy than lifting on both datasets at 16 embedding dimensions
  • For the baseline lifting model in camera frame, we achieve 55.5% Hit@1 on H3.6M, 30.6% on 3DHP (all), and 25.9% on 3DHP (chest)
  • For Pr-VIPE, we achieve 97.5% Hit@1 on H3.6M, 44.3% on 3DHP (all), and 66.4% on 3DHP (chest)
Study subjects and analysis
query-retrieval sample pairs: 6000
H3.6M 3DHP. dure forms 6000 query-retrieval sample pairs for H3.6M (4 views, 12 camera pairs) and 55000 for 3DHP (11 views, 110 camera pairs), which we further bin by their retrieval confidences. Figure 8 shows the matching accuracy under κ = 0.1 for each confidence bin

Reference
  • Ijaz Akhter and Michael J Black. Pose-conditioned joint angle limits for 3d human pose reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1446–1455, 2015. 1, 4
    Google ScholarLocate open access versionFindings
  • Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014. 5
    Google ScholarLocate open access versionFindings
  • Aleksandar Bojchevski and Stephan Gunnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815, 2017. 2
    Findings
  • Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744, 1992
    Google ScholarLocate open access versionFindings
  • Ching-Hang Chen and Deva Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7035–7043, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Stefan Stojanov, and James M Rehg. Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5714–5724, 2019. 2, 4, 12
    Google ScholarLocate open access versionFindings
  • Ruihang Chu, Yifan Sun, Yadong Li, Zheng Liu, Chi Zhang, and Yichen Wei. Vehicle re-identification with viewpointaware metric learning. arXiv preprint arXiv:1910.04104, 2019. 1, 2
    Findings
  • John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. 4
    Google ScholarLocate open access versionFindings
  • Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006. 2
    Google ScholarLocate open access versionFindings
  • Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 2
    Findings
  • Chih-Hui Ho, Pedro Morgado, Amir Persekian, and Nuno Vasconcelos. Pies: Pose invariant embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12377–12386, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Wenze Hu and Song-Chun Zhu. Learning a probabilistic model mixing 3d and 2d primitives for view invariant object recognition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2273– 2280. IEEE, 2010. 2
    Google ScholarLocate open access versionFindings
  • Chen Huang, Chen Change Loy, and Xiaoou Tang. Local similarity-aware deep feature embedding. In Advances in neural information processing systems, pages 1262–1270, 2016. 2
    Google ScholarLocate open access versionFindings
  • Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013. 3, 5, 12
    Google ScholarLocate open access versionFindings
  • Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Mining on manifolds: Metric learning without labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7642–7651, 2018. 2
    Google ScholarLocate open access versionFindings
  • Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. arXiv preprint arXiv:1905.05754, 2019. 2
    Findings
  • Xiaofei Ji and Honghai Liu. Advances in view-invariant human motion analysis: a review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1):13–24, 2009. 2
    Google ScholarLocate open access versionFindings
  • Xiaofei Ji, Honghai Liu, Yibo Li, and David Brown. Visualbased view-invariant human motion analysis: A review. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 741–748. Springer, 2008. 2
    Google ScholarLocate open access versionFindings
  • Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017. 2, 4
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 4
    Findings
  • Muhammed Kocabas, Salih Karagoz, and Emre Akbas. Selfsupervised learning of 3d human pose using multi-view geometry. arXiv preprint arXiv:1903.02330, 2019. 2
    Findings
  • Yann LeCun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR (2), pages 97–104. Citeseer, 2004. 2
    Google ScholarLocate open access versionFindings
  • Junnan Li, Yongkang Wong, Qi Zhao, and Mohan Kankanhalli. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems, pages 1254–1264, 2018. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 12
    Google ScholarLocate open access versionFindings
  • Jian Liu, Naveed Akhtar, and Mian Ajmal. Viewpoint invariant action recognition using rgb-d videos. IEEE Access, 6:70061–70071, 2018. 2
    Google ScholarLocate open access versionFindings
  • Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2640–2649, 2017. 1, 2, 4, 5, 12
    Google ScholarLocate open access versionFindings
  • Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), pages 506–516. IEEE, 2017. 5
    Google ScholarLocate open access versionFindings
  • Greg Mori, Caroline Pantofaru, Nisarg Kothari, Thomas Leung, George Toderici, Alexander Toshev, and Weilong Yang. Pose embeddings: A deep architecture for learning to match human poses. arXiv preprint arXiv:1507.00302, 2015. 1, 2
    Findings
  • Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, and Andrew Gallagher. Modeling uncertainty with hedged instance embedding. arXiv preprint arXiv:1810.00319, 2018. 2, 3, 4
    Findings
  • Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004– 4012, 2016. 2
    Google ScholarLocate open access versionFindings
  • Eng-Jon Ong, Antonio S Micilotta, Richard Bowden, and Adrian Hilton. Viewpoint invariant exemplar-based 3d human tracking. Computer Vision and Image Understanding, 104(2-3):178–189, 2006. 2
    Google ScholarLocate open access versionFindings
  • George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pages 269–286, 2018. 4, 12, 15
    Google ScholarLocate open access versionFindings
  • Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In bmvc, volume 1, page 6, 2015. 2
    Google ScholarLocate open access versionFindings
  • Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019. 2
    Google ScholarLocate open access versionFindings
  • Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun Zeng. Cross view fusion for 3d human pose estimation. arXiv preprint arXiv:1909.01203, 2019. 2
    Findings
  • Cen Rao and Mubarak Shah. View-invariance in action recognition. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 2, pages II–II. IEEE, 2001. 2
    Google ScholarLocate open access versionFindings
  • Mir Rayat Imtiaz Hossain and James J Little. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 68–84, 2018. 2
    Google ScholarLocate open access versionFindings
  • Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 750–767, 2018. 2
    Google ScholarLocate open access versionFindings
  • Helge Rhodin, Jorg Sporri, Isinsu Katircioglu, Victor Constantin, Frederic Meyer, Erich Muller, Mathieu Salzmann, and Pascal Fua. Learning monocular 3d human pose estimation from multi-view images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8437–8446, 2018. 2
    Google ScholarLocate open access versionFindings
  • Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pages 529–545, 2018. 2
    Google ScholarLocate open access versionFindings
  • Bugra Tekin, Pablo Marquez-Neila, Mathieu Salzmann, and Pascal Fua. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3941– 3950, 2017. 2
    Google ScholarLocate open access versionFindings
  • Denis Tome, Matteo Toso, Lourdes Agapito, and Chris Russell. Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In 2018 International Conference on 3D Vision (3DV), pages 474–483. IEEE, 2018. 2
    Google ScholarLocate open access versionFindings
  • Luke Vilnis and Andrew McCallum. Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623, 2014. 2
    Findings
  • Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014. 2
    Google ScholarLocate open access versionFindings
  • Paul Wohlhart and Vincent Lepetit. Learning descriptors for object recognition and 3d pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3109–3118, 2015. 2
    Google ScholarLocate open access versionFindings
  • Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017. 2
    Google ScholarLocate open access versionFindings
  • Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. View invariant human action recognition using histograms of 3d joints. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 20–27. IEEE, 2012. 2
    Google ScholarLocate open access versionFindings
  • Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 2019. 2
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision, pages 398– 407, 2017. 2
    Google ScholarLocate open access versionFindings
  • 1. Visualization of 3D Visual Similarity
    Google ScholarFindings
  • 2. Additional Implementation Details
    Google ScholarFindings
  • 3. Additional Ablation Studies
    Google ScholarFindings
  • 5. Qualitative Results
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科