AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have shown how we can combine the representative feature learning ability of CNN and the efficient long-range message passing as well as the relational feature learning capability of Graph Neural Networks

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

european conference on computer vision, pp.718-734, (2020)

被引用5|浏览131
下载 PDF 全文
引用
微博一下

摘要

Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously. Previous methods can be divided into two streams, i.e. top-down and bottom-up methods. The top-down methods localize keypoints after human detection, while the bottom-up methods localize keypoints directly and then cluste...更多

代码

数据

0
简介
  • Multi-person pose estimation aims at localizing 2d keypoints of an unknown number of people in an image.
  • Previous works generally treat the grouping stage as post-processing by using integer linear programming [18,19,23,35], heuristic greedy parsing [3,33], or clustering [29,31].
  • Associate embedding (AE) [29] produces the permutation-invariant associative embedding for each keypoint, and learns by pushing apart the embedding of different people and pulling closer that of the same instance
  • It uses the associative embedding which encodes pairwise relationship to group keypoints, the grouping procedure itself is still offline, and no direct supervision is applied to the grouping results.
  • Even though the pairwise loss is low, the parsing results can still be possibly wrong, and vice versa
重点内容
  • Multi-person pose estimation aims at localizing 2d keypoints of an unknown number of people in an image
  • Associate embedding (AE) [29] produces the permutation-invariant associative embedding for each keypoint, and learns by pushing apart the embedding of different people and pulling closer that of the same instance. It uses the associative embedding which encodes pairwise relationship to group keypoints, the grouping procedure itself is still offline, and no direct supervision is applied to the grouping results
  • We propose the Online Hierarchical Graph Clustering (OHGC) algorithm, which makes the process of grouping keypoints learnable and can be embedded into main-stream bottom-up methods
  • We see that the proposed Hierarchical Graph Grouping (HGG) model achieves overall 67.6 Average Percision (AP). which is slightly lower than the state-of-the-art method PersonLab [33]
  • HGG even outperforms top-down method SBL with 2.7 AP in test set, which further indicates our method is robust on more challenging scenarios
  • – The experimental results show that the proposed method outperforms the baseline by a large margin and achieves comparable performance with the state-of-the-art bottom-up pose estimation methods on COCO dataset
  • We have shown how we can combine the representative feature learning ability of CNN and the efficient long-range message passing as well as the relational feature learning capability of Graph Neural Networks (GNN)
方法
  • Overview An overview of the proposed hierarchical graph grouping (HGG) framework is illustrated in Fig 2.
  • Following AE [29], the authors use a 4-stacked hourglass [30] as the backbone of the keypoint candidate proposal network.
  • The keypoint proposal network provides keypoint candidates and raw relational feature embedding for the keypoints grouping module.
  • In the keypoint grouping stage, the authors build a graph neural network using the candidates and relational features extracted from the former stage.
  • As shown in #7, #8 and #9, both graph-based models perform better than Ours-FC baseline, because of more effective interactive message passing.
结果
  • Results on MSCOCO dataset

    Table 1a shows experimental results on MSCOCO test-dev set. The authors see that the proposed HGG model achieves overall 67.6 AP. which is slightly lower than the state-of-the-art method PersonLab [33].
  • Ours are lower than them in AP50, in AP75 ours are superior to them
  • This further indicates that the methods have advantages in scenarios that require high-precision pose estimation.
  • The authors can see that the method achieves 41.8% and 36.0% mAP on val and test set, establishing a new state-of-the-art.
  • HGG even outperforms top-down method SBL with 2.7 AP in test set, which further indicates the method is robust on more challenging scenarios.
  • The authors find that the time cost of the grouping module is only a small proportion of the total time cost
结论
  • The authors have reformulated the human pose estimation problem using the graph model and presented a full end-to-end learning framework named HGG.
  • The authors have shown how the authors can combine the representative feature learning ability of CNN and the efficient long-range message passing as well as the relational feature learning capability of GNN.
  • The authors envision that the proposed framework can be applied to other related problems such as multi-object tracking and instance segmentation.
  • The authors expect to see more research in this direction in the near future
表格
  • Table1: a) Comparisons with both top-down and bottom-up methods on COCO2017 test-dev dataset. ∗ means using single-person pose refinement. × means using extra segmentation annotation. + means using multi-scale test. Not that our results are obtained without single-person pose refinement.(b) Comparisons with both top-down and bottom-up methods on OCHuman dataset. Our results are obtained without single-person pose refinement
  • Table2: Ablation study of HGG’s components on the COCO validation dataset.“FinalM” means the final level macro-node discriminator. “Edge” means edge discriminator. “ IntermM” means intermediate macro-node discriminator
Download tables as Excel
相关工作
  • 2.1 Multi-person Pose Estimation in Images

    Top-down methods [6,12,16,17,26,28,34,38,46] decompose the multi-person pose estimation task into two sub-tasks:(1) Human detection and (2) Pose Estimation in the region of a single human. First, the person detector predicts a bounding box for every human instance in the image. Second, the box is cropped and resized from the image. Third, single-person pose estimation is applied to predict the keypoints for the cropped person. In addition, some work such as Mask R-CNN [16] crop the feature instead of raw images to boost efficiency. In summary, top-down methods are dominant in state-of-the-art methods but they often have higher computational complexity overhead, especially when the number of human instances increases. This is because they need to repeatedly run the single-person pose estimation for every instance. Furthermore, because the pose estimation is dependent on the detection, it is difficult for these methods to recover the pose of an instance if it is missing in the detection results.
基金
  • This work is partially supported by the SenseTime Donation for Research, HKU Seed Fund for Basic Research, Startup Fund, General Research Fund No.27208720, the Australian Research Council Grant DP200103223 and Australian Medical Research Future Fund MRFAI000085
引用论文
  • Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: International Conference on Machine Learning (ICML) (2009)
    Google ScholarLocate open access versionFindings
  • Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
    Findings
  • Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
    Google ScholarLocate open access versionFindings
  • Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
    Google ScholarLocate open access versionFindings
  • Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
    Google ScholarLocate open access versionFindings
  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
    Google ScholarLocate open access versionFindings
  • Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J., Kalantidis, Y.: Graphbased global reasoning networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Chu, X., Ouyang, W., Wang, X., et al.: Crf-cnn: Modeling structured information in human pose estimation. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
    Google ScholarLocate open access versionFindings
  • Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2007)
    Google ScholarLocate open access versionFindings
  • Doering, A., Iqbal, U., Gall, J.: Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596 (2018)
    Findings
  • Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., AspuruGuzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecular fingerprints. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)
    Google ScholarLocate open access versionFindings
  • Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: Rmpe: Regional multi-person pose estimation. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
    Google ScholarLocate open access versionFindings
  • Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision (IJCV) (2005)
    Google ScholarLocate open access versionFindings
  • Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Transactions on Computers (1973)
    Google ScholarLocate open access versionFindings
  • Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: IEEE International Joint Conference on Neural Networks (IJCNN) (2005)
    Google ScholarLocate open access versionFindings
  • He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. arXiv preprint arXiv:1703.06870 (2017)
    Findings
  • Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
    Google ScholarLocate open access versionFindings
  • Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B., Campus, S.I.: Arttrack: Articulated multi-person tracking in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
    Google ScholarLocate open access versionFindings
  • Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision (ECCV) (2016)
    Google ScholarLocate open access versionFindings
  • Iqbal, U., Milan, A., Andriluka, M., Ensafutdinov, E., Pishchulin, L., Gall, J., B., S.: PoseTrack: A benchmark for human pose estimation and tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
    Google ScholarLocate open access versionFindings
  • Iqbal, U., Milan, A., Gall, J.: Pose-track: Joint multi-person pose estimation and tracking. arXiv preprint arXiv:1611.07727 (2016)
    Findings
  • Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Jin, S., Ma, X., Han, Z., Wu, Y., Yang, W., Liu, W., Qian, C., Ouyang, W.: Towards multi-person pose tracking: Bottom-up and top-down methods. In: ICCV PoseTrack Workshop (2017)
    Google ScholarLocate open access versionFindings
  • Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
    Google ScholarFindings
  • Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
    Findings
  • Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV) (2014)
    Google ScholarLocate open access versionFindings
  • Liu, W., Chen, J., Li, C., Qian, C., Chu, X., Hu, X.: A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. In: The Thirty-Second AAAI Conference on Artificial Intelligence (2018)
    Google ScholarLocate open access versionFindings
  • Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
    Google ScholarLocate open access versionFindings
  • Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision (ECCV) (2016)
    Google ScholarLocate open access versionFindings
  • Nie, X., Feng, J., Xing, J., Yan, S.: Generative partition networks for multi-person pose estimation. arXiv preprint arXiv:1705.07422 (2017)
    Findings
  • Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
    Google ScholarLocate open access versionFindings
  • Papandreou, G., Zhu, T., Chen, L.C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: Person pose estimation and instance segmentation with a bottom-up, partbased, geometric embedding model. arXiv preprint arXiv:1803.08225 (2018)
    Findings
  • Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. arXiv preprint arXiv:1701.01779 (2017)
    Findings
  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
    Google ScholarLocate open access versionFindings
  • Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Transactions on Neural Networks (TNN) (2008)
    Google ScholarLocate open access versionFindings
  • Song, J., Andres, B., Black, M.J., Hilliges, O., Tang, S.: End-to-end learning for graph decomposition. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
    Google ScholarLocate open access versionFindings
  • Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212 (2019)
    Findings
  • Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
    Google ScholarLocate open access versionFindings
  • Tian, Z., Chen, H., Shen, C.: Directpose: Direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451 (2019)
    Findings
  • Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
    Google ScholarLocate open access versionFindings
  • Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
    Google ScholarLocate open access versionFindings
  • Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. International Conference on Learning Representations (ICLR) (2018)
    Google ScholarLocate open access versionFindings
  • Wang, J., Peng, Z., Lv, P., Sun, J., Zhou, B., Xu, M.: Bi-directional graph structure information model for multi-person pose estimation. arXiv preprint arXiv:1805.00603 (2018)
    Findings
  • Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) (2019)
    Google ScholarLocate open access versionFindings
  • Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: European Conference on Computer Vision (ECCV) (2018)
    Google ScholarLocate open access versionFindings
  • Xie, E., Sun, P., Song, X., Wang, W., Liu, X., Liang, D., Shen, C., Luo, P.: Polarmask: Single shot instance segmentation with polar representation. arXiv preprint arXiv:1909.13226 (2019)
    Findings
  • Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) (2018)
    Google ScholarLocate open access versionFindings
  • Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2012)
    Google ScholarLocate open access versionFindings
  • Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760 (2019)
    Findings
  • Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H., Hu, S.M.: Pose2seg: detection free human instance segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Zhou, X., Wang, D., Krahenbuhl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科