BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation

Cited by: 0|Bibtex|Views17
Other Links: arxiv.org
Weibo:
The complementary boundary generator takes the advantage of U-shaped architecture and bi-directional boundary matching mechanism to learn rich contexts for boundary prediction

Abstract:

Generating human action proposals in untrimmed videos is an important yet challenging task with wide applications. Current methods often suffer from the noisy boundary locations and the inferior quality of confidence scores used for proposal retrieving. In this paper, we present BSN++, a new framework which exploits complementary bounda...More

Code:

Data:

0
Introduction
  • Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances.
  • Though BSN achieves convincing performance, it still suffers from three main drawbacks: (1) BSN only employs the local details around the boundaries to predict boundaries, without taking advantage of the rich temporal contexts through the whole video sequence; (2) BSN fails to consider the proposalproposal relations for confidence evaluation; (3) the imbalance data distribution between positive/negative proposals and temporal durations is neglected
  • To relieve these issues, the authors propose BSN++, for temporal proposal generation.
  • The main contributions of the work are listed below in three-folds:
Highlights
  • Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances
  • The other type of methods (Lin et al 2018; Xiong et al 2017; Lin et al 2019) attract many researchers recently which tackle this problem in a bottom-up fashion, where the input video is evaluated in a finer-level. (Lin et al 2018) is a typical method in this type which proposes the Boundary-Sensitive Network (BSN) to generate proposals with flexible durations and reliable confidence scores
  • We propose the complementary boundary regressor, where the starting classifier can be used to predict the ending locations when the input videos are processed in a reversed direction, and vice versa
  • We propose BSN++, which is unique to previous works in three main aspects: (1) we revisit the boundary prediction task and propose a complementary boundary generator to exploit rich contexts together with bi-directional matching strategy for boundary prediction; (2) we propose a proposal relation block for proposal-proposal relations modeling; (3) two-stage resampling scheme is designed for equivalent balancing
  • We propose BSN++ for temporal action proposal generation
  • Further combining with the existing action classifiers, our method can achieve the state-ofthe-art temporal action detection performance
  • The complementary boundary generator takes the advantage of U-shaped architecture and bi-directional boundary matching mechanism to learn rich contexts for boundary prediction
Methods
  • SSAD-prop CTAP BSN MGG BMN BSN++

    AR@1 AR@100 AUC. Method SSN SSAD BSN BMN Ours

    ActivityNet-1.3, mAP@tIoU validation

    THUMOS14 (testing), mAP@tIoU TURN BSN MGG BMN

    UNet UNet UNet UNet Ours

    UNet 22.8 31.9 41.3 49.5 59.9 ported.
  • THUMOS14, mAP@tIoU TURN BSN MGG BMN.
  • To further examine the quality of proposals generated by BSN++, following BSN(Lin et al 2018), the authors feed them to the state-of-the-art action classifiers to obtain the categories for action detection in a “detection by classification” framework.
  • On ActivityNet1.3, the authors use the top-1 video-level classification results generated by (Xiong et al 2016) for all the generated proposals.
  • On THUMOS14, the authors use the top-2 video-level classification results generated by UntrimmedNet (Wang et al 2017).
  • The authors can observe that with the same classifiers, the detection performance of the method can be boosted greatly, which can further demonstrate the effectiveness and superiority of the method
Results
  • Evaluation metrics

    Following the conventions, Average Recall (AR) is calculated under different tIoU thresholds which are set to [0.5:0.05:0.95] on ActivityNet-1.3, and [0.5:0.05:1.0] on THUMOS14.
  • The authors calculate the area (AUC) under the AR vs AN curve as another evaluation metric on ActivityNet-1.3 dataset, where AN ranges from 0 to 100.
  • Comparison to the state-of-the-arts.
  • It can be observed that the BSN++ outperforms other state-of-the-art proposal generation methods with a big margin in terms of AR@AN and AUC on validation set of ActivityNet-1.3.
  • For a direct comparison to BSN, the BSN++ improves AUC from 66.17% to 68.26% on validation set.
  • On ActivityNet-1.3, the mAP with tIoU thresholds set {0.5, 0.75, 0.95} and the average mAP with tIoU thresholds [0.5:0.05:0.95] are re-
Conclusion
  • The authors propose BSN++ for temporal action proposal generation.
  • The authors are the first to consider the imbalanced data distribution of proposal durations.
  • Both the boundary map and confidence map can be generated simultaneously in a unified network.
  • Extensive experiments conducted on ActivityNet-1.3 and THUMOS14 datasets demonstrate the effectiveness of the method in both temporal action proposal and detection performance
Summary
  • Introduction:

    Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances.
  • Though BSN achieves convincing performance, it still suffers from three main drawbacks: (1) BSN only employs the local details around the boundaries to predict boundaries, without taking advantage of the rich temporal contexts through the whole video sequence; (2) BSN fails to consider the proposalproposal relations for confidence evaluation; (3) the imbalance data distribution between positive/negative proposals and temporal durations is neglected
  • To relieve these issues, the authors propose BSN++, for temporal proposal generation.
  • The main contributions of the work are listed below in three-folds:
  • Methods:

    SSAD-prop CTAP BSN MGG BMN BSN++

    AR@1 AR@100 AUC. Method SSN SSAD BSN BMN Ours

    ActivityNet-1.3, mAP@tIoU validation

    THUMOS14 (testing), mAP@tIoU TURN BSN MGG BMN

    UNet UNet UNet UNet Ours

    UNet 22.8 31.9 41.3 49.5 59.9 ported.
  • THUMOS14, mAP@tIoU TURN BSN MGG BMN.
  • To further examine the quality of proposals generated by BSN++, following BSN(Lin et al 2018), the authors feed them to the state-of-the-art action classifiers to obtain the categories for action detection in a “detection by classification” framework.
  • On ActivityNet1.3, the authors use the top-1 video-level classification results generated by (Xiong et al 2016) for all the generated proposals.
  • On THUMOS14, the authors use the top-2 video-level classification results generated by UntrimmedNet (Wang et al 2017).
  • The authors can observe that with the same classifiers, the detection performance of the method can be boosted greatly, which can further demonstrate the effectiveness and superiority of the method
  • Results:

    Evaluation metrics

    Following the conventions, Average Recall (AR) is calculated under different tIoU thresholds which are set to [0.5:0.05:0.95] on ActivityNet-1.3, and [0.5:0.05:1.0] on THUMOS14.
  • The authors calculate the area (AUC) under the AR vs AN curve as another evaluation metric on ActivityNet-1.3 dataset, where AN ranges from 0 to 100.
  • Comparison to the state-of-the-arts.
  • It can be observed that the BSN++ outperforms other state-of-the-art proposal generation methods with a big margin in terms of AR@AN and AUC on validation set of ActivityNet-1.3.
  • For a direct comparison to BSN, the BSN++ improves AUC from 66.17% to 68.26% on validation set.
  • On ActivityNet-1.3, the mAP with tIoU thresholds set {0.5, 0.75, 0.95} and the average mAP with tIoU thresholds [0.5:0.05:0.95] are re-
  • Conclusion:

    The authors propose BSN++ for temporal action proposal generation.
  • The authors are the first to consider the imbalanced data distribution of proposal durations.
  • Both the boundary map and confidence map can be generated simultaneously in a unified network.
  • Extensive experiments conducted on ActivityNet-1.3 and THUMOS14 datasets demonstrate the effectiveness of the method in both temporal action proposal and detection performance
Tables
  • Table1: Performance comparisons with other state-ofthe-art proposal generation methods on validation set of ActivityNet-1.3 in terms of AUC and AR@AN
  • Table2: Comparisons with other state-of-the-art proposal generation methods SCNN-prop(<a class="ref-link" id="cShou_et+al_2016_a" href="#rShou_et+al_2016_a">Shou, Wang, and Chang 2016</a>), SST(Buch et al 2017b), TURN(<a class="ref-link" id="cGao_et+al_2017_a" href="#rGao_et+al_2017_a">Gao et al 2017</a>), MGG(<a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al 2019</a>), BSN(Lin et al 2018), BMN(<a class="ref-link" id="cLin_et+al_2019_a" href="#rLin_et+al_2019_a">Lin et al 2019</a>) on THUMOS14 in terms of AR@AN, where SNMS stands for Soft-NMS
  • Table3: Ablation experiments in the validation set of ActivityNet-1.3. Complementary boundary generator is abbreviated as CBG and BBM denotes bi-directional matching. PRB is the proposal relation block and SBS is the scalebalanced sampling. PAM and CAM indicate the two selfattention modules. Inference speed here is the seconds (s) cost Tcost for processing a 3-minute videos using a Nvidia 1080-Ti card. e2e denotes the joint training manner
  • Table4: Generalizability evaluation on ActivityNet-1.3
  • Table5: Detection results compared with (<a class="ref-link" id="cShou_et+al_2017_a" href="#rShou_et+al_2017_a">Shou et al 2017</a>; <a class="ref-link" id="cZhao_et+al_2017_a" href="#rZhao_et+al_2017_a">Zhao et al 2017</a>; <a class="ref-link" id="cLin_et+al_2017_a" href="#rLin_et+al_2017_a">Lin, Zhao, and Shou 2017a</a>; Lin et al 2018, 2019) on validation set of ActivityNet-1.3, where our proposals are combined with video-level classification results generated by (<a class="ref-link" id="cXiong_et+al_2016_a" href="#rXiong_et+al_2016_a">Xiong et al 2016</a>)
  • Table6: Detection results compared with (<a class="ref-link" id="cGao_et+al_2017_a" href="#rGao_et+al_2017_a">Gao et al 2017</a>; Lin et al 2018; <a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al 2019</a>; <a class="ref-link" id="cLin_et+al_2019_a" href="#rLin_et+al_2019_a">Lin et al 2019</a>) on testing set of THUMOS14, where video-level classifier (<a class="ref-link" id="cWang_et+al_2017_a" href="#rWang_et+al_2017_a">Wang et al 2017</a>) is combined with proposals generated by BSN++
Download tables as Excel
Related work
Funding
  • Further combining with the existing action classifiers, our method can achieve the state-ofthe-art temporal action detection performance
  • For a direct comparison to BSN, our BSN++ improves AUC from 66.17% to 68.26% on validation set
  • When the AN is 100, our method significantly improves AR from 74.16% to 76.52% by 2.36%
Reference
  • Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Improving Object Detection With One Line of Code. In arXiv preprint arXiv:1704.04503.
    Findings
  • Buch, S.; Escorcia, V.; Ghanem, B.; Fei-Fei, L.; and Niebles, J. C. 2017a. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In Proceedings of the British Machine Vision Conference.
    Google ScholarLocate open access versionFindings
  • Buch, S.; Escorcia, V.; Shen, C.; Ghanem, B.; and Niebles, J. C. 2017b. Sst: Single-stream temporal action proposals. In CVPR, 6373–6382. IEEE.
    Google ScholarLocate open access versionFindings
  • Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; and Carlos Niebles, J. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 961–970.
    Google ScholarLocate open access versionFindings
  • Cordt, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2018. The open images dataset v4: Unified image classification, object detec- tion, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
    Findings
  • Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweile, M.; and Schiele, B. 201The cityscapes dataset for semantic urban scene understanding. CVPR.
    Google ScholarLocate open access versionFindings
  • Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019. Class-balanced loss based on effective number of samples. CVPR.
    Google ScholarLocate open access versionFindings
  • Estabrooks, A.; Jo, T.; and Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence.
    Google ScholarFindings
  • Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR, 1933–1941.
    Google ScholarFindings
  • Gao, J.; Kan, C.; and Nevatia, R. 2018. CTAP: Complementary Temporal Action Proposal Generation. arXiv preprint arXiv:1807.04821.
    Findings
  • Gao, J.; Yang, Z.; and Nevatia, R. 2017. Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180.
    Findings
  • Gao, J.; Yang, Z.; Sun, C.; Chen, K.; and Nevatia, R. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In ICCV, 3648–3656. IEEE.
    Google ScholarLocate open access versionFindings
  • Girshick, R. 2015. Fast r-cnn. In ICCV, 1440–1448.
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
    Google ScholarLocate open access versionFindings
  • Ioffe, S.; and Szegedy, C. 20Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
    Findings
  • Jiang, Y.; Liu, J.; Zamir, A. R.; Toderici, G.; Laptev, I.; Shah, M.; and Sukthankar, R. 2014. THUMOS challenge: Action recognition with a large number of classes. In Computer Vision-ECCV workshop 2014.
    Google ScholarLocate open access versionFindings
  • Lin, T.; Liu, X.; Li, X.; Ding, E.; and Wen, S. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. CoRR abs/1907.09702.
    Findings
  • Lin, T.; Zhao, X.; and Shou, Z. 2017a. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, 988–996. ACM.
    Google ScholarLocate open access versionFindings
  • Lin, T.; Zhao, X.; and Shou, Z. 2017b. Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017. arXiv preprint arXiv:1707.06750.
    Findings
  • Lin, T.; Zhao, X.; Su, H.; Wang, C.; and Yang, M. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. arXiv preprint arXiv:1806.02964.
    Findings
  • Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollar, P. 2017. Focal loss for dense object detection. ICCV.
    Google ScholarLocate open access versionFindings
  • Liu, Y.; Ma, L.; Zhang, Y.; Liu, W.; and Chang, S.-F. 2019. Multi-granularity Generator for Temporal Action Proposal. In CVPR, 3604–3613.
    Google ScholarFindings
  • McCarthy, K.; Zabar, B.; and Weiss, G. 2005. Does costsensitive learning beat sampling for classifying rare classes? Proceedings ofthe 1st international workshop on Utilitybased data mining.
    Google ScholarLocate open access versionFindings
  • Qiu, Z.; Yao, T.; and Tao, M. 2017. Learning SpatioTemporal Representation with Pseudo-3D Residual Networks. In ICCV, 5534–5542.
    Google ScholarLocate open access versionFindings
  • Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241. Springer.
    Google ScholarLocate open access versionFindings
  • Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; and Chang, S.-F. 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 1417–14IEEE.
    Google ScholarLocate open access versionFindings
  • Shou, Z.; Wang, D.; and Chang, S.-F. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 1049–1058.
    Google ScholarFindings
  • Shrivastava, A.; Gupta, A.; and Girshick, R. 2016. Training region-based object detectors with online hard ex- ample mining. CVPR.
    Google ScholarLocate open access versionFindings
  • Simonyan, K.; and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576.
    Google ScholarLocate open access versionFindings
  • Singh, G.; and Cuzzolin, F. 2016. Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979.
    Findings
  • Su, H.; Zhao, X.; and Lin, T. 2018. Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization. In ACCV.
    Google ScholarFindings
  • Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 4489–4497.
    Google ScholarFindings
  • Wang, H.; Klaser, A.; Schmid, C.; and Liu, C.-L. 2011. Action recognition by dense trajectories. In CVPR, 3169–3176. IEEE.
    Google ScholarLocate open access versionFindings
  • Wang, H.; and Schmid, C. 2013. Action recognition with improved trajectories. In ICCV, 3551–3558.
    Google ScholarLocate open access versionFindings
  • Wang, L.; Xiong, Y.; Lin, D.; and Van Gool, L. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR, volume 2.
    Google ScholarLocate open access versionFindings
  • Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 20–Springer.
    Google ScholarLocate open access versionFindings
  • Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Nonlocal Neural Networks. CVPR.
    Google ScholarLocate open access versionFindings
  • Weiss, G. M.; McCarthy, K.; and Zabar, B. 2007. Costsensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? Dmin.
    Google ScholarFindings
  • Xiong, Y.; Wang, L.; Wang, Z.; Zhang, B.; Song, H.; Li, W.; Lin, D.; Qiao, Y.; Gool, L. V.; and Tang, X. 2016. CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016. CVPR ActivityNet Workshop.
    Google ScholarFindings
  • Xiong, Y.; Zhao, Y.; Wang, L.; Lin, D.; and Tang, X. 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716.
    Findings
  • Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; and Lin, D. 2017. Temporal action detection with structured segment networks. In ICCV, volume 2.
    Google ScholarLocate open access versionFindings
  • Zhou, Z.; Siddiquee, M. M. R.; Tajbakhsh, N.; and Liang, J. 2018. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. MICCAI.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments