AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We proposed the epipolar transformer, which enables 2D pose detectors to leverage 3D-aware features through fusing features along the epipolar lines of neighboring views

Epipolar Transformers

CVPR 2020, (2020)

被引用1|浏览51
下载 PDF 全文
引用
微博一下

摘要

A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps: (1) apply a 2D detector separately on each view to localize joints in 2D, and (2) perform robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is li...更多

代码

数据

0
简介
  • In order to estimate the 3D pose of a human body or hand, there are two common settings.
  • The second setting, which is the focus of the paper, is multi-view 3D pose estima-
  • To this end, the authors propose the fully differentiable “epipolar transformer” module, which enables a 2D detector to gain access to 3D information in the intermediate layers of the 2D detector itself, and during the final robust triangulation phase.
  • The authors do not know where the correct p is, so in order to approximate the feature at p , the authors first leverage the epipolar line generated by p in the source view to limit the potential locations of p.
  • Note that the aforementioned operation is done densely for all locations in an intermediate feature map, so the final output of the module is a set of 3D-aware intermediate features that have the same dimensions as the input feature map
重点内容
  • In order to estimate the 3D pose of a human body or hand, there are two common settings
  • We propose the fully differentiable “epipolar transformer” module, which enables a 2D detector to gain access to 3D information in the intermediate layers of the 2D detector itself, and during the final robust triangulation phase
  • The epipolar transformer augments the intermediate features of a 2D detector for a given view with features from neighboring views, making the intermediate features 3D-aware as shown in Figure 1
  • To compute the 3D-aware intermediate feature at location p in the reference view, we first find the point corresponding to p in the source view: p, and fuse the feature at p with the feature at p to get a 3D-aware feature
  • We proposed the epipolar transformer, which enables 2D pose detectors to leverage 3D-aware features through fusing features along the epipolar lines of neighboring views
  • The epipolar transformer has very few learnable parameters and outputs features with the same dimension as the input, enabling it to be augmented to existing 2D pose estimation networks
方法
  • The authors have conducted the experiments on two largescale pose estimation datasets with multi-view images and ground-truth 3D pose annotations: an internal hand dataset InterHand, and a publicly available human pose dataset Human3.6M [13].

    InterHand dataset: InterHand is an internal hand dataset that is captured in a synchronized multi-view studio with 34 color cameras and 46 monochrome cameras.
  • The authors have conducted the experiments on two largescale pose estimation datasets with multi-view images and ground-truth 3D pose annotations: an internal hand dataset InterHand, and a publicly available human pose dataset Human3.6M [13].
  • InterHand dataset: InterHand is an internal hand dataset that is captured in a synchronized multi-view studio with 34 color cameras and 46 monochrome cameras.
  • 21 keypoints were annotated, so there are 42 unique points for two hands
结果
  • Evaluation metric

    During training, the authors use Mean Squared Error (MSE) between the predicted and ground truth heatmaps as the loss.
  • To estimate the accuracy of 3D pose prediction, the authors adopt the MPJPE (Mean Per Joint Position Error) metric.
  • It is one of the most popular evaluation metrics, which is referred to as Protocol #1 in [21].
  • It is calculated by the average of the L2 distance between the ground-truth and predictions of each joint.
结论
  • The authors proposed the epipolar transformer, which enables 2D pose detectors to leverage 3D-aware features through fusing features along the epipolar lines of neighboring views.
  • Qualitative analysis of feature matching along the epipolar line show that the epipolar transformer can provide more accurate matches in difficult scenarios with occlusions.
  • The epipolar transformer has very few learnable parameters and outputs features with the same dimension as the input, enabling it to be augmented to existing 2D pose estimation networks.
  • The authors believe that the epipolar transformer can benefit 3D vision tasks such as deep multiview stereo [40]
总结
  • Introduction:

    In order to estimate the 3D pose of a human body or hand, there are two common settings.
  • The second setting, which is the focus of the paper, is multi-view 3D pose estima-
  • To this end, the authors propose the fully differentiable “epipolar transformer” module, which enables a 2D detector to gain access to 3D information in the intermediate layers of the 2D detector itself, and during the final robust triangulation phase.
  • The authors do not know where the correct p is, so in order to approximate the feature at p , the authors first leverage the epipolar line generated by p in the source view to limit the potential locations of p.
  • Note that the aforementioned operation is done densely for all locations in an intermediate feature map, so the final output of the module is a set of 3D-aware intermediate features that have the same dimensions as the input feature map
  • Methods:

    The authors have conducted the experiments on two largescale pose estimation datasets with multi-view images and ground-truth 3D pose annotations: an internal hand dataset InterHand, and a publicly available human pose dataset Human3.6M [13].

    InterHand dataset: InterHand is an internal hand dataset that is captured in a synchronized multi-view studio with 34 color cameras and 46 monochrome cameras.
  • The authors have conducted the experiments on two largescale pose estimation datasets with multi-view images and ground-truth 3D pose annotations: an internal hand dataset InterHand, and a publicly available human pose dataset Human3.6M [13].
  • InterHand dataset: InterHand is an internal hand dataset that is captured in a synchronized multi-view studio with 34 color cameras and 46 monochrome cameras.
  • 21 keypoints were annotated, so there are 42 unique points for two hands
  • Results:

    Evaluation metric

    During training, the authors use Mean Squared Error (MSE) between the predicted and ground truth heatmaps as the loss.
  • To estimate the accuracy of 3D pose prediction, the authors adopt the MPJPE (Mean Per Joint Position Error) metric.
  • It is one of the most popular evaluation metrics, which is referred to as Protocol #1 in [21].
  • It is calculated by the average of the L2 distance between the ground-truth and predictions of each joint.
  • Conclusion:

    The authors proposed the epipolar transformer, which enables 2D pose detectors to leverage 3D-aware features through fusing features along the epipolar lines of neighboring views.
  • Qualitative analysis of feature matching along the epipolar line show that the epipolar transformer can provide more accurate matches in difficult scenarios with occlusions.
  • The epipolar transformer has very few learnable parameters and outputs features with the same dimension as the input, enabling it to be augmented to existing 2D pose estimation networks.
  • The authors believe that the epipolar transformer can benefit 3D vision tasks such as deep multiview stereo [40]
表格
  • Table1: Architecture design comparison for both Inter- plugged into different stages of neighboring source views for in-
  • Table2: Epipolar transformer
  • Table3: Different number of
  • Table4: Comparison with cross-view fusion [<a class="ref-link" id="c28" href="#r28">28</a>]. The baseline is using Hourglass networks [<a class="ref-link" id="c23" href="#r23">23</a>] for InterHand and Resnet-50 [<a class="ref-link" id="c12" href="#r12">12</a>] for Human3.6M [<a class="ref-link" id="c13" href="#r13">13</a>] without view fusion
  • Table5: Table 5
  • Table6: Comparison with state-of-the-art methods on Human3.6M [<a class="ref-link" id="c13" href="#r13">13</a>], where no additional training data is used unless specified. The metric is MPJPE (mm). "+": rotation and scaling augmentation. "-": models trained using released code [<a class="ref-link" id="c28" href="#r28">28</a>], where the per action MPJPE
  • Table7: Comparison with state-of-the-art methods using external datasets on Human3.6M [<a class="ref-link" id="c13" href="#r13">13</a>]. +: data augmentation (i.e., cube rotation [<a class="ref-link" id="c15" href="#r15">15</a>]). "err.": the error metric is MPJPE (mm). "tri." stands for triangulation. Number of Parameters and MAC (multiply-add operations) are calculated using THOP2
Download tables as Excel
相关工作
  • Multi-view 3D Human Pose Estimation: There are many methods proposed for multi-view human pose estimation. Pavllo et al [26] proposed estimating 3D human pose in video via dilated temporal convolutions over 2D keypoints. Rhodin et al [29] proposed to leverage multi-view constraints as weak supervision to enhance a monocular 3D human pose detector when labeled data is limited. Our method is most similar to Qiu et al [28] and Iskakov et al [15], thus we provide a more detailed comparison in the following paragraphs.

    Qiu et al [28] proposed to fuse features from other views through learning a fixed attention weight for all pairs of pixels for each pair of views. The advantage of this method is that camera calibration is no longer required. However, the disadvantages are (1) more data from each view is needed to train this attention weight, (2) there are significantly more weights to learn when the number of views and the image resolution increases, and (3) during test time, if the multi-camera setup changes, then the attention learned during training time is no longer applicable. On the other hand, although the proposed epipolar transformer relies on camera calibration, it only adds minimal learnable parameters. This makes it significantly easier to train, and thus has less demand on the number of training images per view (Table 4). Furthermore, the network trained with the epipolar transformer can be applied to an unseen multi-camera setup without additional training as long as if the calibration parameters are provided.
基金
  • Proposes the fully differentiable “epipolar transformer” module, which enables a 2D detector to gain access to 3D information in the intermediate layers of the 2D detector itself, and during the final robust triangulation phase
  • Explores the Bottleneck Embedded Gaussian architecture, which was popularized by non-local networks , as the feature fusion module as shown in Figure 2
引用论文
  • Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014. 6, 8
    Google ScholarLocate open access versionFindings
  • Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1
    Google ScholarLocate open access versionFindings
  • Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Cristian Sminchisescu Catalin Ionescu, Fuxin Li. Latent structured models for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, 2011.
    Google ScholarLocate open access versionFindings
  • Ricson Cheng, Ziyan Wang, and Katerina Fragkiadaki. Geometry-aware recurrent neural networks for active visual recognition. In Advances in Neural Information Processing Systems. 2018. 8
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. 6
    Google ScholarLocate open access versionFindings
  • Endri Dibra, Silvan Melchior, Ali Balkis, Thomas Wolf, Cengiz Oztireli, and Markus Gross. Monocular rgb hand pose inference from unsupervised refinable nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018. 3
    Google ScholarLocate open access versionFindings
  • Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013
    Google ScholarLocate open access versionFindings
  • Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011
    Google ScholarLocate open access versionFindings
  • Liuhao Ge, Zhou Ren, and Junsong Yuan. Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. 2, 3, 6
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3, 6, 7, 11, 12, 13
    Google ScholarLocate open access versionFindings
  • Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 201, 2, 4, 5, 6, 7, 8, 11, 12, 13
    Google ScholarFindings
  • Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. arXiv preprint arXiv:1905.05754, 2019. 2, 8
    Findings
  • Yasamin Jafarian, Yuan Yao, and Hyun Soo Park. Monet: Multiview semi-supervised keypoint via epipolar divergence. arXiv preprint arXiv:1806.00104, 2018. 1, 3
    Findings
  • Abdolrahim Kadkhodamohammadi and Nicolas Padoy. A generalizable approach for multi-view 3d human pose regression. arXiv preprint arXiv:1804.10462, 2018. 8
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6
    Findings
  • Shile Li and Dongheui Lee. Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 203
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, 2014. 8
    Google ScholarLocate open access versionFindings
  • Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, 2017. 4
    Google ScholarLocate open access versionFindings
  • Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, 2016. 4, 5, 6, 11
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch, 2017. 6
    Google ScholarLocate open access versionFindings
  • Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. Harvesting multiple views for marker-less 3d human pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8
    Google ScholarLocate open access versionFindings
  • Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2
    Google ScholarLocate open access versionFindings
  • Vignesh Prasad, Dipanjan Das, and Brojeshwar Bhowmick. Epipolar geometry based learning of multi-view depth and ego-motion from monocular sequences. arXiv preprint arXiv:1812.11922, 2018. 3
    Findings
  • Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun Zeng. Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, 2019. 2, 6, 7, 8, 11
    Google ScholarLocate open access versionFindings
  • Helge Rhodin, Jörg Spörri, Isinsu Katircioglu, Victor Constantin, Frédéric Meyer, Erich Müller, Mathieu Salzmann, and Pascal Fua. Learning monocular 3d human pose estimation from multi-view images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2
    Google ScholarLocate open access versionFindings
  • Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 3
    Google ScholarLocate open access versionFindings
  • Denis Tome, Matteo Toso, Lourdes Agapito, and Chris Russell. Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In 2018 International Conference on 3D Vision (3DV), 2018. 8
    Google ScholarLocate open access versionFindings
  • Hsiao-Yu Fish Tung, Ricson Cheng, and Katerina Fragkiadaki. Learning spatial common sense with geometry-aware recurrent networks. In The IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 2017. 3
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Xiaokun Wu, Daniel Finnegan, Eamonn O’Neill, and YongLiang Yang. Handmap: robust hand pose estimation via intermediate dense guidance map supervision. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3
    Google ScholarLocate open access versionFindings
  • Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, 2018. 6, 11
    Google ScholarLocate open access versionFindings
  • Guandao Yang, Tomasz Malisiewicz, Serge Belongie, Erez Farhan, Sungsoo Ha, Yuewei Lin, Xiaojing Huang, Hanfei Yan, and Wei Xu. Learning data-adaptive interest points through epipolar adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019. 3
    Google ScholarLocate open access versionFindings
  • Linlin Yang and Angela Yao. Disentangling latent hands for image synthesis and pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision, 2018. 6, 8
    Google ScholarLocate open access versionFindings
  • Qi Ye and Tae-Kyun Kim. Occlusion-aware hand pose estimation using hierarchical mixture density network. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Shanxin Yuan, Bjorn Stenger, and Tae-Kyun Kim. Rgbbased 3d hand pose estimation via privileged learning with depth images. arXiv preprint arXiv:1811.07376, 2018. 3
    Findings
  • Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and TaeKyun Kim. Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3
    Google ScholarLocate open access versionFindings
  • Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. End-to-end hand mesh recovery from a monocular rgb image. In Proceedings of the IEEE International Conference on Computer Vision, 2019. 1, 3
    Google ScholarLocate open access versionFindings
  • Yidan Zhou, Jian Lu, Kuo Du, Xiangbo Lin, Yi Sun, and Xiaohong Ma. Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision, 2018. 3
    Google ScholarLocate open access versionFindings
  • Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, 2017. 1
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科