Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos

european conference on computer vision, pp. 36-52, 2020.

Cited by: 0|Views41
Weibo:
We have proposed a self-supervised keypoint correspondence framework for the tasks of multi-frame pose estimation and multi-person pose tracking

Abstract:

Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames t...More

Code:

Data:

0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Human pose estimation is a very active research field in computer vision that is relevant for many applications like computer games, security, sports, and autonomous driving.
  • Recently proposed video datasets [1] are less diverse and are sparsely annotated as compared to large scale image datasets for human pose estimation [26,40]
  • This makes it challenging to learn deep networks for associating human keypoints across frames that are robust to equal contribution nuisance factors such as motion blur, fast motions, and occlusions as they occur in videos
Highlights
  • Human pose estimation is a very active research field in computer vision that is relevant for many applications like computer games, security, sports, and autonomous driving
  • – We propose an approach for multi-frame pose estimation and multi-person pose tracking that relies on self-supervised keypoint correpondences which are learned from a large scale image dataset
  • We evaluate multi-frame pose estimation and tracking results using the mAP and MOTA evaluation metrics
  • We provide implementation details for our top-down pose estimation and keypoint correspondences framework below
  • We have proposed a self-supervised keypoint correspondence framework for the tasks of multi-frame pose estimation and multi-person pose tracking
  • The proposed keypoint correspondences tracking approach outperforms the state-of-the-art for the tasks of multi-frame pose estimation and multi-person pose tracking on the PoseTrack 2017 and PoseTrack 2018 datasets
Methods
  • Method Ours

    HRNet [37] POINet [35] MDPN [13] LightTrack [32] ProTracker [12] STAF [34] STEmbeddings [20] JointFlow [10]

    Detection Improvement Correspondences Temporal OKS Ensemble

    Ensemble / BBox Prop. -

    Tracking Keypoint Correspondences

    Optical Flow Ovonic Insight Net

    Optical FLow GCN IoU

    STFields STEmbeddings Flow Fields

    Multi-person pose tracking Recent success in multi-person pose estimation in still images has led researchers to work on the challenging problem of multiperson pose tracking.
  • The authors propose a multi-person pose tracking framework that is robust to motion blur and severe occlusions, it does not need any video data for training.
  • The results show that correspondences based tracking (1) achieves consistent improvement over the baselines both with GT and detected boxes, respectively, with MOTA scores of 70.5 and 67.9 and (2) significantly reduces the number of identity switches.
Results
  • The authors evaluate the approach on the Posetrack 2017 and Posetrack 2018 datasets [1]. The datasets have 292 and 593 videos for training and 214 and 375 videos for evaluation, respectively.
  • The authors use a top down framework for frame level pose estimation.
  • The authors extract crops of size 384×288 around detected people as input to the pose estimation framework.
  • Each stage uses GoogleNet [38] as a backbone followed by a pose decoder.
  • The column in the middle shows warped poses generated by optical flow in another frame f + 1.
  • The right column shows poses warped using correspondences for frame f + 1.
Conclusion
  • The authors have proposed a self-supervised keypoint correspondence framework for the tasks of multi-frame pose estimation and multi-person pose tracking.
  • The proposed keypoint correspondence framework solves two tasks: (1) recovering missed detections and (2) associating human poses across video frames for the task of multi-person pose tracking.
  • The proposed keypoint correspondences tracking approach outperforms the state-of-the-art for the tasks of multi-frame pose estimation and multi-person pose tracking on the PoseTrack 2017 and PoseTrack 2018 datasets.
  • As future work the authors plan to investigate a unified framework that simultaneously performs keypoints correspondences and pose estimation.
Summary
  • Introduction:

    Human pose estimation is a very active research field in computer vision that is relevant for many applications like computer games, security, sports, and autonomous driving.
  • Recently proposed video datasets [1] are less diverse and are sparsely annotated as compared to large scale image datasets for human pose estimation [26,40]
  • This makes it challenging to learn deep networks for associating human keypoints across frames that are robust to equal contribution nuisance factors such as motion blur, fast motions, and occlusions as they occur in videos
  • Objectives:

    The authors' goal is to utilize keypoint correspondences to recover missed poses of a top-down human pose estimator, e.g. due to partial occlusion and to utilize keypoint correspondences for multiperson tracking.
  • Given two images I1 and I2 with keypoints {jp}1:Np for all persons p in image I1, the goal is to find the corresponding keypoints in I2.
  • The authors' goal is to assign pose instances {Bfp} = {Jfp} ∪ {Jfp} in frame f for persons p ∈ {1,
  • Methods:

    Method Ours

    HRNet [37] POINet [35] MDPN [13] LightTrack [32] ProTracker [12] STAF [34] STEmbeddings [20] JointFlow [10]

    Detection Improvement Correspondences Temporal OKS Ensemble

    Ensemble / BBox Prop. -

    Tracking Keypoint Correspondences

    Optical Flow Ovonic Insight Net

    Optical FLow GCN IoU

    STFields STEmbeddings Flow Fields

    Multi-person pose tracking Recent success in multi-person pose estimation in still images has led researchers to work on the challenging problem of multiperson pose tracking.
  • The authors propose a multi-person pose tracking framework that is robust to motion blur and severe occlusions, it does not need any video data for training.
  • The results show that correspondences based tracking (1) achieves consistent improvement over the baselines both with GT and detected boxes, respectively, with MOTA scores of 70.5 and 67.9 and (2) significantly reduces the number of identity switches.
  • Results:

    The authors evaluate the approach on the Posetrack 2017 and Posetrack 2018 datasets [1]. The datasets have 292 and 593 videos for training and 214 and 375 videos for evaluation, respectively.
  • The authors use a top down framework for frame level pose estimation.
  • The authors extract crops of size 384×288 around detected people as input to the pose estimation framework.
  • Each stage uses GoogleNet [38] as a backbone followed by a pose decoder.
  • The column in the middle shows warped poses generated by optical flow in another frame f + 1.
  • The right column shows poses warped using correspondences for frame f + 1.
  • Conclusion:

    The authors have proposed a self-supervised keypoint correspondence framework for the tasks of multi-frame pose estimation and multi-person pose tracking.
  • The proposed keypoint correspondence framework solves two tasks: (1) recovering missed detections and (2) associating human poses across video frames for the task of multi-person pose tracking.
  • The proposed keypoint correspondences tracking approach outperforms the state-of-the-art for the tasks of multi-frame pose estimation and multi-person pose tracking on the PoseTrack 2017 and PoseTrack 2018 datasets.
  • As future work the authors plan to investigate a unified framework that simultaneously performs keypoints correspondences and pose estimation.
Tables
  • Table1: Overview of related works on multi-person pose tracking and theirs respective contributions
  • Table2: MOTA and Identity Switches (IDSW) comparison with baselines over the PoseTrack 2017 validation set. The comparison is performed using the same set of pose detections obtained with ground truth and detected boxes. Correspondence based tracking consistently improves MOTA over all the baselines and significantly reduces the number of identity switches
  • Table3: Effect of joint detection thresholds and missed detections on mAP and MOTA on the PoseTrack 2018 validation set. The results are shown for (i) detected poses only and (ii) detected and recovered poses. As expected, recovering missed detections improve both MOTA and mAP. A good trade-off between mAP and MOTA is achieved at joint detection threshold of 0.3
  • Table4: Comparison to the state-of-the-art on the PoseTrack 2017 and 2018 validation set for multi-frame pose estimation
  • Table5: Comparison to the state-of-the-art on the PoseTrack 17/18 validation and test sets. Approaches marked with + use additional external training data. Approaches marked with ∗ do not report results on the official test set
  • Table6: Impact of τcorr on mAP and MOTA during tracking
  • Table7: Comparison of mAP and MOTA for different design choices on the PoseTrack 2017 validation set
Download tables as Excel
Related work
  • Multi-person pose estimation is an actively researched area. Multi-person pose estimation can be categorized into top-down and bottomup approaches, where former are superior over bottom-up methods as shown in the MS-COCO benchmark [26]. In recent years, researchers tackle the problem of multi-person pose estimation and tracking in video datasets such as PoseTrack [1]. This task comes with a set of additional challenges.

    Multi-Person Pose Estimation Bottom-up based methods [22,7,29,15,31] first detect all person keypoints simultaneously and then associate body parts with their corresponding person instances. [7] is one of the most popular works that predicts part affinity fields (PAF) which preserve location and orientation information of limbs. These PAFs are used with a greedy part association algorithm. More recently, [22] propose to detect bounding boxes and pose keypoints within the same neural network. Bounding box predictions are used to crop from predicted keypoint heatmaps. As a second stage, the authors propose a pose residual module which regresses the respective keypoint locations of each person instance.
Funding
  • Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets
  • Our approach achieves state-of-the-art results without using any additional training data except of [26] for the proposed correspondence network
  • As shown in Table 7, omitting any of the introduced design choices results in a significant drop in MOTA of at least 1%, and increases the number of identity switches (IDSW)
Reference
  • Andriluka, M., Iqbal, U., Ensafutdinov, E., Pishchulin, L., Milan, A., Gall, J., B., S.: PoseTrack: A benchmark for human pose estimation and tracking. In: CVPR (2018)
    Google ScholarFindings
  • Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)
    Google ScholarFindings
  • Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely-labeled videos. NeurIPS (2019)
    Google ScholarLocate open access versionFindings
  • Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely-labeled videos. In: NeurIPS (2019)
    Google ScholarFindings
  • Bertinetto, L., Valmadre, J., Henriques, J., Vedaldi, A., Torr, P.: Fullyconvolutional siamese networks for object tracking. In: ECCV (2016)
    Google ScholarFindings
  • Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. CVPR (2017)
    Google ScholarLocate open access versionFindings
  • Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
    Google ScholarFindings
  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. CVPR (2018)
    Google ScholarLocate open access versionFindings
  • Choy, C., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. NIPS (2016)
    Google ScholarLocate open access versionFindings
  • Doering, A., Iqbal, U., Gall, J.: Joint flow: Temporal flow fields for multi person tracking. BMVC (2018)
    Google ScholarFindings
  • Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV (Jan 2005)
    Google ScholarFindings
  • Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Tran, D.: Detect-and-Track: Efficient Pose Estimation in Videos. In: CVPR (2018)
    Google ScholarFindings
  • Guo, H., Tang, T., Luo, G., Chen, R., Lu, Y., Wen, L.: Multi-domain pose network for multi-person pose estimation and tracking. In: CVPR (2018)
    Google ScholarFindings
  • Han, K., R.S.R, Ham, B., Wong, K., Cho, M., Schmid, C., Ponce., J.: Scnet: Learningsemantic correspondence. ICCV (2017)
    Google ScholarLocate open access versionFindings
  • He, K., Gkioxari, G., Dollar, P., Girshick, R.B.: Mask R-CNN. ICCV (2017)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
    Google ScholarFindings
  • Hwang, J., Lee, J., Park, S., Kwak, N.: Pose estimator and tracker using temporal flow maps for limbs (2019). https://doi.org/10.1109/IJCNN.2019.8851734
    Findings
  • Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., Schiele, B.: ArtTrack: Articulated Multi-person Tracking in the Wild. In: CVPR (2017)
    Google ScholarFindings
  • Iqbal, U., Milan, A., Gall, J.: Posetrack: Joint multi-person pose estimation and tracking. In: CVPR (2017)
    Google ScholarFindings
  • Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. CVPR (2019)
    Google ScholarLocate open access versionFindings
  • Kim, S., Min, D., Ham, B., Jeon, S., Lin, S., Sohn., K.: Fully convolutional selfsimilarity for dense semantic correspondence. CVPR (2017)
    Google ScholarLocate open access versionFindings
  • Kocabas, M., Karagoz, S., Akbas, E.: MultiPoseNet: Fast multi-person pose estimation using pose residual network. In: European Conference on Computer Vision (ECCV) (2018)
    Google ScholarLocate open access versionFindings
  • Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: CVPR (2018)
    Google ScholarFindings
  • Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., Sun, J.: Rethinking on multi-stage networks for human pose estimation. arXiv preprint (2019)
    Google ScholarFindings
  • Li, X., Liu, S., Mello, S.D., Wang, X., Kautz, J., Yang, M.H.: Joint-task selfsupervised learning for temporal correspondence. In: NeurIPS (2019)
    Google ScholarFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., Zitnick, C.: Microsoft coco: Common objects in context (05 2014)
    Google ScholarFindings
  • Luo, H., Gu, Y., Liao, X., Lai, S., Jiang, W.: Bag of tricks and a strong baseline for deep person re-identification. In: CVPRW (2019)
    Google ScholarFindings
  • Moon, G., Chang, J.Y., Lee, K.M.: Multi-scale aggregation R-CNN for 2d multiperson pose estimation. In: CVPRW (2019)
    Google ScholarFindings
  • Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: NIPS (2017)
    Google ScholarFindings
  • Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
    Google ScholarFindings
  • Nie, X., Feng, J., Xing, J., Yan, S.: Generative partition networks for multi-person pose estimation. ECCV (2018)
    Google ScholarLocate open access versionFindings
  • Ning, G., Huang, H.: Lighttrack: A generic framework for online top-down human pose tracking. arXiv preprint (2019)
    Google ScholarFindings
  • Ning, G., Liu, P., Fan, X., Zhang, C.: A top-down approach to articulated human pose estimation and tracking. ArXiV-Preprint (2019)
    Google ScholarLocate open access versionFindings
  • Raaj, Y., Idrees, H., Hidalgo, G., Sheikh, Y.: Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: CVPR (2019)
    Google ScholarFindings
  • Ruan, W., Liu, W., Bao, Q., Chen, J., Cheng, Y., Mei, T.: Poinet: Pose-guided ovonic insight network for multi-person pose tracking. In: ICM (2019)
    Google ScholarFindings
  • Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
    Google ScholarFindings
  • Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
    Google ScholarFindings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
    Google ScholarFindings
  • Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycleconsistency of time. In: CVPR (2019)
    Google ScholarFindings
  • Wu, J., Zheng, H., Zhao, B., Li, Y., Yan, B., Liang, R., Wang, W., Zhou, S., Lin, G., Fu, Y., Wang, Y., Wang, Y.: AI challenger: A large-scale dataset for going deeper in image understanding. ArXiV-preprint (2017)
    Google ScholarLocate open access versionFindings
  • Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)
    Google ScholarFindings
  • Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose Flow: Efficient online pose tracking. BMVC (2018)
    Google ScholarFindings
  • Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. ICCV (2017)
    Google ScholarLocate open access versionFindings
  • Yu, D., Su, K., Sun, J., Wang, C.: Multi-person pose estimation for pose tracking with enhanced cascaded pyramid network. In: ECCV Workshop (2018)
    Google ScholarFindings
  • Zhang, R., Zhu, Z., Li, P., Wu, R., Guo, C., Huang, G., Xia, H.: Exploiting offsetguided network for pose estimation and tracking. In: CVPR (2019)
    Google ScholarFindings
  • Zhang, Z., Peng, H., Wang, Q.: Deeper and wider siamese networks for real-time visual tracking. CVPR (2019)
    Google ScholarLocate open access versionFindings
  • Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. ECCV (2018)
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments