Hierarchical Human Parsing with Typed Part-Relation Reasoning

CVPR, pp. 8926-8936, 2020.

Cited by: 6|Bibtex|Views128|DOI:https://doi.org/10.1109/CVPR42600.2020.00895
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
To address the inference over the loopy human structure, our parser relies on a convolutional, message passing based approximation algorithm, which enjoys the advantages of iterative optimization and spatial information preservation

Abstract:

Human parsing is for pixel-wise human semantic understanding. As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task. Focusing on this, we seek to simultaneously exploit the representational capacity of deep graph networks and the hierarchical human structures. In partic...More

Code:

Data:

0
Introduction
  • Human parsing involves segmenting human bodies into semantic parts, e.g., head, arm, leg, etc.
  • It has attracted tremendous attention in the literature, as it enables finegrained human understanding and finds a wide spectrum of human-centric applications, such as human behavior analysis [50, 14], human-robot interaction [16], and many others.
  • Some nodes are omitted. (d) The authors' hierarchical parsing results
Highlights
  • Human parsing involves segmenting human bodies into semantic parts, e.g., head, arm, leg, etc
  • To respond to the above challenges and enable a deeper understanding of human structures, we develop a unified, structured human parser that precisely describes a more complete set of part relations, and efficiently reasons structures with the prism of a message-passing, feed-back inference scheme
  • In contrast to conventional Message Passing Graph Networks, which are mainly Multilayer Perceptron-based and edgetype-agnostic, we provide a spatial information preserving and relation-type aware graph learning scheme
  • This work proposed a hierarchical human parser that addresses this issue in two aspects
  • Three distinct relation networks are designed to precisely describe the compositional/decompositional relations between constituent and entire parts and help with the dependency learning over kinetically connected parts
  • To address the inference over the loopy human structure, our parser relies on a convolutional, message passing based approximation algorithm, which enjoys the advantages of iterative optimization and spatial information preservation
Methods
  • Though recent human parsers gain impressive results, the model still outperforms all the competitors by a large margin.
  • In terms of pixAcc., mean Acc., and mean IoU, the parser dramatically surpasses the best performing method, CNIF [60], by 1.02%, 1.78% and 1.51%, respectively.
  • The evaluation results demonstrate that the human parser achieves 65.3% mIoU, with substantial gains over the second best, CNIF [60], and third best, LCPC [9], of 4.8% and 11.8%, respectively.
Results
  • The authors follow the official evaluation protocols of each dataset. For LIP, following [71], the authors report pixel accuracy, mean accuracy and mean

    Intersection-over-Union (mIoU).
  • DeepLabV2 [4] - -
Conclusion
  • In the human semantic parsing task, structure modeling is an essential, albeit inherently difficult, avenue to explore.
  • This work proposed a hierarchical human parser that addresses this issue in two aspects.
  • Three distinct relation networks are designed to precisely describe the compositional/decompositional relations between constituent and entire parts and help with the dependency learning over kinetically connected parts.
  • To address the inference over the loopy human structure, the parser relies on a convolutional, message passing based approximation algorithm, which enjoys the advantages of iterative optimization and spatial information preservation.
  • The above designs enable strong performance across five widely adopted benchmark datasets, at times outperforming all other competitors
Summary
  • Introduction:

    Human parsing involves segmenting human bodies into semantic parts, e.g., head, arm, leg, etc.
  • It has attracted tremendous attention in the literature, as it enables finegrained human understanding and finds a wide spectrum of human-centric applications, such as human behavior analysis [50, 14], human-robot interaction [16], and many others.
  • Some nodes are omitted. (d) The authors' hierarchical parsing results
  • Methods:

    Though recent human parsers gain impressive results, the model still outperforms all the competitors by a large margin.
  • In terms of pixAcc., mean Acc., and mean IoU, the parser dramatically surpasses the best performing method, CNIF [60], by 1.02%, 1.78% and 1.51%, respectively.
  • The evaluation results demonstrate that the human parser achieves 65.3% mIoU, with substantial gains over the second best, CNIF [60], and third best, LCPC [9], of 4.8% and 11.8%, respectively.
  • Results:

    The authors follow the official evaluation protocols of each dataset. For LIP, following [71], the authors report pixel accuracy, mean accuracy and mean

    Intersection-over-Union (mIoU).
  • DeepLabV2 [4] - -
  • Conclusion:

    In the human semantic parsing task, structure modeling is an essential, albeit inherently difficult, avenue to explore.
  • This work proposed a hierarchical human parser that addresses this issue in two aspects.
  • Three distinct relation networks are designed to precisely describe the compositional/decompositional relations between constituent and entire parts and help with the dependency learning over kinetically connected parts.
  • To address the inference over the loopy human structure, the parser relies on a convolutional, message passing based approximation algorithm, which enjoys the advantages of iterative optimization and spatial information preservation.
  • The above designs enable strong performance across five widely adopted benchmark datasets, at times outperforming all other competitors
Tables
  • Table1: Comparison of pixel accuracy, mean accuracy and mIoU on LIP val [<a class="ref-link" id="c22" href="#r22">22</a>]. † indicates extra pose information used
  • Table2: Per-class comparison of mIoU on PASCAL-Person-
  • Table3: Comparison of accuracy, foreground accuracy, average precision, recall and F1-score on ATR test[<a class="ref-link" id="c31" href="#r31">31</a>]
  • Table4: Comparison of pixel accuracy, foreground pixel accuracy, average precision, average recall and average f1-score on Fashion Clothing test [<a class="ref-link" id="c45" href="#r45">45</a>]
  • Table5: Comparison of mIoU on PPSS test [<a class="ref-link" id="c44" href="#r44">44</a>]
  • Table6: Ablation study (§4.3) on PASCAL-Person-Part test
Download tables as Excel
Related work
  • Human parsing: Over the past decade, active research has been devoted towards pixel-level human semantic understanding. Early approaches tended to leverage image regions [35, 67, 68], hand-crafted features [57, 7], part templates [2, 11, 10] and human keypoints [66, 35, 67, 68], and typically explored certain heuristics over human body configurations [3, 11, 10] in a CRF [66, 28], structured model [67, 11], grammar model [3, 42, 10], or generative model [13, 51] framework. Recent advance has been driven by the streamlined designs of deep learning architectures. Some pioneering efforts revisit classic template matching strategy [31, 36], address local and global cues [34], or use tree-LSTMs to gather structure information [32, 33]. However, due to the use of superpixel [34, 32, 33] or HOG feature [44], they are fragmentary and time-consuming. Consequent attempts thus follow a more elegant FCN architecture, addressing multi-level cues [5, 62], feature aggregation [45, 71, 38], adversarial learning [70, 46, 37], or crossdomain knowledge [37, 65, 20]. To further explore inherent structures, numerous approaches [64, 71, 22, 63, 15, 47] choose to straightforward encode pose information into the parsers, however, relying on off-the-shelf pose estimators [18, 17] or additional annotations. Some others consider top-down [73] or multi-source semantic [60] information over hierarchical human layouts. Though impressive, they ignore iterative inference and seldom address explicit relation modeling, easily suffering from weak expressive ability and risk of sub-optimal results.
Funding
  • Develops a unified, structured human parser that precisely describes a more complete set of part relations, and efficiently reasons structures with the prism of a message-passing, feed-back inference scheme
  • Evaluates our approach on five standard human parsing datasets , achieving stateof-the-art performance on all of them
  • Provides a spatial information preserving and relation-type aware graph learning scheme
Reference
  • Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI, 39(12):2481–2495, 2017.
    Google ScholarLocate open access versionFindings
  • Yihang Bo and Charless C Fowlkes. Shape-based pedestrian parsing. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Hong Chen, Zi Jian Xu, Zi Qiang Liu, and Song Chun Zhu. Composite templates for cloth modeling and sketching. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4):834–848, 2018.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, and Honghui Shi. Spgnet: Semantic prediction guidance for scene parsing. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Kang Dang and Junsong Yuan. Location constrained pixel classifiers for image parsing with regular spatial layout. In BMVC, 2014.
    Google ScholarLocate open access versionFindings
  • Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, and Shuicheng Yan. A deformable mixture parsing model with parselets. In ICCV, 2013.
    Google ScholarFindings
  • David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • S Eslami and Christopher Williams. A generative model for parts-based object segmentation. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, and Cewu Lu. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated datase for object grasping. arXiv preprint arXiv:1912.13470, 2019.
    Findings
  • Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. Graphonomy: Universal human parsing via graph transfer learning. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. Instance-level human parsing via part grouping network. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structuresensitive learning and a new benchmark for human parsing. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Lubor Ladicky, Philip HS Torr, and Andrew Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, 2013.
    Google ScholarFindings
  • Qizhu Li, Anurag Arnab, and Philip HS Torr. Holistic, instance-level human parsing. In BMVC, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable structureevolving lstm. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep human parsing with active template regression. IEEE TPAMI, 37(12):2402–2414, 2015.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with local-global long short-term memory. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan. Human parsing with contextualized convolutional neural network. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. Fashion parsing with weak color-category labels. TMM, 16(1):253–265, 2014.
    Google ScholarLocate open access versionFindings
  • Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jianchao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao, and Shuicheng Yan. Matching-cnn meets knn: Quasi-parametric human parsing. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Si Liu, Yao Sun, Defa Zhu, Guanghui Ren, Yu Chen, Jiashi Feng, and Jizhong Han. Cross-domain human parsing via adversarial feature and label adaptation. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, and Yao Sun. Surveillance video parsing with single frame supervision. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Ting Liu, Tao Ruan, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, and Thomas Huang. Devil in the details: Towards accurate single and multiple human parsing. arXiv preprint arXiv:1809.05996, 2018.
    Findings
  • Xinchen Liu, Meng Zhang, Wu Liu, Jingkuan Song, and Tao Mei. Braidnet: Braiding semantics and details for accurate human parsing. In ACMMM, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Long Zhu, Yuanhao Chen, Yifei Lu, Chenxi Lin, and A. Yuille. Max margin and/or graph learning for parsing the human body. In CVPR, 2008.
    Google ScholarLocate open access versionFindings
  • Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. Semantic segmentation using adversarial networks. In NIPS-workshop, 2016.
    Google ScholarLocate open access versionFindings
  • Ping Luo, Xiaogang Wang, and Xiaoou Tang. Pedestrian parsing via deep decompositional network. In ICCV, 2013.
    Google ScholarFindings
  • Xianghui Luo, Zhuo Su, Jiaming Guo, Gengwei Zhang, and Xiangjian He. Trusted guidance pyramid network for human parsing. In ACMMM, 2018.
    Google ScholarLocate open access versionFindings
  • Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. Macro-micro adversarial network for human parsing. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. Mutual learning to adapt for joint human parsing and pose estimation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Seyoung Park, Bruce Xiaohan Nie, and Song-Chun Zhu. Attribute and-or grammar for joint parsing of human pose, parts and attributes. IEEE TPAMI, 40(7):1555–1569, 2018.
    Google ScholarLocate open access versionFindings
  • Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Ingmar Rauschert and Robert T Collins. A generative model for simultaneous estimation of human body shape and pixellevel segmentation. In ECCV, 2012.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
    Google ScholarLocate open access versionFindings
  • Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Nan Wang and Haizhou Ai. Who blocks who: Simultaneous clothing segmentation for grouping images. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. Zero-shot video object segmentation via attentive graph neural networks. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning compositional neural information fusion for human parsing. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Fangting Xia, Peng Wang, Liang-Chieh Chen, and Alan L Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Fangting Xia, Jun Zhu, Peng Wang, and Alan L Yuille. Poseguided human parsing by an and/or graph using pose-context features. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Wenqiang Xu, Yonglu Li, and Cewu Lu. Srda: Generating instance segmentation annotation via scanning, reasoning and domain adaptation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Kota Yamaguchi, M Hadi Kiapour, and Tamara L Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 2013.
    Google ScholarFindings
  • Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L Berg. Parsing clothing in fashion photographs. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • Wei Yang, Ping Luo, and Liang Lin. Clothing co-parsing by joint image segmentation and labeling. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In ACMMM, 2018.
    Google ScholarLocate open access versionFindings
  • Jian Zhao, Jianshu Li, Xuecheng Nie, Fang Zhao, Yunpeng Chen, Zhecan Wang, Jiashi Feng, and Shuicheng Yan. Selfsupervised neural aggregation networks for human parsing. In CVPR-workshop, 2017.
    Google ScholarFindings
  • Zilong Zheng, Wenguan Wang, Siyuan Qi, and Song-Chun Zhu. Reasoning visual dialogs with structural and partial observations. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Bingke Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Progressive cognitive human parsing. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments