BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation
Weibo:
Abstract:
Generating human action proposals in untrimmed videos is an important yet challenging task with wide applications. Current methods often suffer from the noisy boundary locations and the inferior quality of confidence scores used for proposal retrieving. In this paper, we present BSN++, a new framework which exploits complementary bounda...More
Code:
Data:
Introduction
- Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances.
- Though BSN achieves convincing performance, it still suffers from three main drawbacks: (1) BSN only employs the local details around the boundaries to predict boundaries, without taking advantage of the rich temporal contexts through the whole video sequence; (2) BSN fails to consider the proposalproposal relations for confidence evaluation; (3) the imbalance data distribution between positive/negative proposals and temporal durations is neglected
- To relieve these issues, the authors propose BSN++, for temporal proposal generation.
- The main contributions of the work are listed below in three-folds:
Highlights
- Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances
- The other type of methods (Lin et al 2018; Xiong et al 2017; Lin et al 2019) attract many researchers recently which tackle this problem in a bottom-up fashion, where the input video is evaluated in a finer-level. (Lin et al 2018) is a typical method in this type which proposes the Boundary-Sensitive Network (BSN) to generate proposals with flexible durations and reliable confidence scores
- We propose the complementary boundary regressor, where the starting classifier can be used to predict the ending locations when the input videos are processed in a reversed direction, and vice versa
- We propose BSN++, which is unique to previous works in three main aspects: (1) we revisit the boundary prediction task and propose a complementary boundary generator to exploit rich contexts together with bi-directional matching strategy for boundary prediction; (2) we propose a proposal relation block for proposal-proposal relations modeling; (3) two-stage resampling scheme is designed for equivalent balancing
- We propose BSN++ for temporal action proposal generation
- Further combining with the existing action classifiers, our method can achieve the state-ofthe-art temporal action detection performance
- The complementary boundary generator takes the advantage of U-shaped architecture and bi-directional boundary matching mechanism to learn rich contexts for boundary prediction
Methods
- SSAD-prop CTAP BSN MGG BMN BSN++
AR@1 AR@100 AUC. Method SSN SSAD BSN BMN Ours
ActivityNet-1.3, mAP@tIoU validation
THUMOS14 (testing), mAP@tIoU TURN BSN MGG BMN
UNet UNet UNet UNet Ours
UNet 22.8 31.9 41.3 49.5 59.9 ported. - THUMOS14, mAP@tIoU TURN BSN MGG BMN.
- To further examine the quality of proposals generated by BSN++, following BSN(Lin et al 2018), the authors feed them to the state-of-the-art action classifiers to obtain the categories for action detection in a “detection by classification” framework.
- On ActivityNet1.3, the authors use the top-1 video-level classification results generated by (Xiong et al 2016) for all the generated proposals.
- On THUMOS14, the authors use the top-2 video-level classification results generated by UntrimmedNet (Wang et al 2017).
- The authors can observe that with the same classifiers, the detection performance of the method can be boosted greatly, which can further demonstrate the effectiveness and superiority of the method
Results
- Evaluation metrics
Following the conventions, Average Recall (AR) is calculated under different tIoU thresholds which are set to [0.5:0.05:0.95] on ActivityNet-1.3, and [0.5:0.05:1.0] on THUMOS14. - The authors calculate the area (AUC) under the AR vs AN curve as another evaluation metric on ActivityNet-1.3 dataset, where AN ranges from 0 to 100.
- Comparison to the state-of-the-arts.
- It can be observed that the BSN++ outperforms other state-of-the-art proposal generation methods with a big margin in terms of AR@AN and AUC on validation set of ActivityNet-1.3.
- For a direct comparison to BSN, the BSN++ improves AUC from 66.17% to 68.26% on validation set.
- On ActivityNet-1.3, the mAP with tIoU thresholds set {0.5, 0.75, 0.95} and the average mAP with tIoU thresholds [0.5:0.05:0.95] are re-
Conclusion
- The authors propose BSN++ for temporal action proposal generation.
- The authors are the first to consider the imbalanced data distribution of proposal durations.
- Both the boundary map and confidence map can be generated simultaneously in a unified network.
- Extensive experiments conducted on ActivityNet-1.3 and THUMOS14 datasets demonstrate the effectiveness of the method in both temporal action proposal and detection performance
Summary
Introduction:
Temporal action detection task has received much attention from many researchers in recent years, which requires categorizing the real-world untrimmed videos and locating the temporal boundaries of action instances.- Though BSN achieves convincing performance, it still suffers from three main drawbacks: (1) BSN only employs the local details around the boundaries to predict boundaries, without taking advantage of the rich temporal contexts through the whole video sequence; (2) BSN fails to consider the proposalproposal relations for confidence evaluation; (3) the imbalance data distribution between positive/negative proposals and temporal durations is neglected
- To relieve these issues, the authors propose BSN++, for temporal proposal generation.
- The main contributions of the work are listed below in three-folds:
Methods:
SSAD-prop CTAP BSN MGG BMN BSN++
AR@1 AR@100 AUC. Method SSN SSAD BSN BMN Ours
ActivityNet-1.3, mAP@tIoU validation
THUMOS14 (testing), mAP@tIoU TURN BSN MGG BMN
UNet UNet UNet UNet Ours
UNet 22.8 31.9 41.3 49.5 59.9 ported.- THUMOS14, mAP@tIoU TURN BSN MGG BMN.
- To further examine the quality of proposals generated by BSN++, following BSN(Lin et al 2018), the authors feed them to the state-of-the-art action classifiers to obtain the categories for action detection in a “detection by classification” framework.
- On ActivityNet1.3, the authors use the top-1 video-level classification results generated by (Xiong et al 2016) for all the generated proposals.
- On THUMOS14, the authors use the top-2 video-level classification results generated by UntrimmedNet (Wang et al 2017).
- The authors can observe that with the same classifiers, the detection performance of the method can be boosted greatly, which can further demonstrate the effectiveness and superiority of the method
Results:
Evaluation metrics
Following the conventions, Average Recall (AR) is calculated under different tIoU thresholds which are set to [0.5:0.05:0.95] on ActivityNet-1.3, and [0.5:0.05:1.0] on THUMOS14.- The authors calculate the area (AUC) under the AR vs AN curve as another evaluation metric on ActivityNet-1.3 dataset, where AN ranges from 0 to 100.
- Comparison to the state-of-the-arts.
- It can be observed that the BSN++ outperforms other state-of-the-art proposal generation methods with a big margin in terms of AR@AN and AUC on validation set of ActivityNet-1.3.
- For a direct comparison to BSN, the BSN++ improves AUC from 66.17% to 68.26% on validation set.
- On ActivityNet-1.3, the mAP with tIoU thresholds set {0.5, 0.75, 0.95} and the average mAP with tIoU thresholds [0.5:0.05:0.95] are re-
Conclusion:
The authors propose BSN++ for temporal action proposal generation.- The authors are the first to consider the imbalanced data distribution of proposal durations.
- Both the boundary map and confidence map can be generated simultaneously in a unified network.
- Extensive experiments conducted on ActivityNet-1.3 and THUMOS14 datasets demonstrate the effectiveness of the method in both temporal action proposal and detection performance
Tables
- Table1: Performance comparisons with other state-ofthe-art proposal generation methods on validation set of ActivityNet-1.3 in terms of AUC and AR@AN
- Table2: Comparisons with other state-of-the-art proposal generation methods SCNN-prop(<a class="ref-link" id="cShou_et+al_2016_a" href="#rShou_et+al_2016_a">Shou, Wang, and Chang 2016</a>), SST(Buch et al 2017b), TURN(<a class="ref-link" id="cGao_et+al_2017_a" href="#rGao_et+al_2017_a">Gao et al 2017</a>), MGG(<a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al 2019</a>), BSN(Lin et al 2018), BMN(<a class="ref-link" id="cLin_et+al_2019_a" href="#rLin_et+al_2019_a">Lin et al 2019</a>) on THUMOS14 in terms of AR@AN, where SNMS stands for Soft-NMS
- Table3: Ablation experiments in the validation set of ActivityNet-1.3. Complementary boundary generator is abbreviated as CBG and BBM denotes bi-directional matching. PRB is the proposal relation block and SBS is the scalebalanced sampling. PAM and CAM indicate the two selfattention modules. Inference speed here is the seconds (s) cost Tcost for processing a 3-minute videos using a Nvidia 1080-Ti card. e2e denotes the joint training manner
- Table4: Generalizability evaluation on ActivityNet-1.3
- Table5: Detection results compared with (<a class="ref-link" id="cShou_et+al_2017_a" href="#rShou_et+al_2017_a">Shou et al 2017</a>; <a class="ref-link" id="cZhao_et+al_2017_a" href="#rZhao_et+al_2017_a">Zhao et al 2017</a>; <a class="ref-link" id="cLin_et+al_2017_a" href="#rLin_et+al_2017_a">Lin, Zhao, and Shou 2017a</a>; Lin et al 2018, 2019) on validation set of ActivityNet-1.3, where our proposals are combined with video-level classification results generated by (<a class="ref-link" id="cXiong_et+al_2016_a" href="#rXiong_et+al_2016_a">Xiong et al 2016</a>)
- Table6: Detection results compared with (<a class="ref-link" id="cGao_et+al_2017_a" href="#rGao_et+al_2017_a">Gao et al 2017</a>; Lin et al 2018; <a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al 2019</a>; <a class="ref-link" id="cLin_et+al_2019_a" href="#rLin_et+al_2019_a">Lin et al 2019</a>) on testing set of THUMOS14, where video-level classifier (<a class="ref-link" id="cWang_et+al_2017_a" href="#rWang_et+al_2017_a">Wang et al 2017</a>) is combined with proposals generated by BSN++
Related work
- Action Recognition
Action recognition is an essential branch which has been extensively explored in recent years. Earlier methods such as improved Dense Trajectory (iDT) (Wang et al 2011; Wang and Schmid 2013) mainly adopt the hand-crafted features including HOG, MBH and HOF. Current deep learning based methods (Feichtenhofer, Pinz, and Zisserman 2016; Simonyan and Zisserman 2014; Tran et al 2015; Wang et al 2016) typically contain two main categories: the twostream networks (Feichtenhofer, Pinz, and Zisserman 2016; Simonyan and Zisserman 2014) capture the appearance features and motion information from RGB image and stacked optical flow respectively; 3D networks (Tran et al 2015; Qiu, Yao, and Tao 2017) exploit 3D convolutional layers to capture the spatial and temporal information directly from the raw videos. Action recognition networks are usually adopted to extract visual feature sequence from untrimmed videos for the temporal action proposals and detection task.
Imbalanced Distribution Training
Imbalanced data distribution naturally exists in many largescale datasets (Cordt et al 2018; Cordts et al 2016). Current literature can be mainly divided into three categories: (1) re-sampling, includes oversampling the minority classes (Estabrooks, Jo, and Japkowicz 2004) or downsampling the majority classes (Weiss, McCarthy, and Zabar 2007); (2) reweighting, namely cost sensitive learning (McCarthy, Zabar, and Weiss 2005; Cui et al 2019), which aims to dynamically adjust the weight of samples or different classes during training process. (3) In object detection task, the imbalanced data distribution is more serious between background and foreground for one-stage detector. Some methods such as Focal loss (Lin et al 2017) and online hard negative mining (Shrivastava, Gupta, and Girshick 2016) are designed for two-stage detector. In this paper, we implement the scalebalanced re-sampling upon the IoU-balanced sampling for proposal confidence evaluation, motivated by the mini-batch imbalanced loss distribution against proposal durations.
Funding
- Further combining with the existing action classifiers, our method can achieve the state-ofthe-art temporal action detection performance
- For a direct comparison to BSN, our BSN++ improves AUC from 66.17% to 68.26% on validation set
- When the AN is 100, our method significantly improves AR from 74.16% to 76.52% by 2.36%
Reference
- Bodla, N.; Singh, B.; Chellappa, R.; and Davis, L. S. 2017. Improving Object Detection With One Line of Code. In arXiv preprint arXiv:1704.04503.
- Buch, S.; Escorcia, V.; Ghanem, B.; Fei-Fei, L.; and Niebles, J. C. 2017a. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In Proceedings of the British Machine Vision Conference.
- Buch, S.; Escorcia, V.; Shen, C.; Ghanem, B.; and Niebles, J. C. 2017b. Sst: Single-stream temporal action proposals. In CVPR, 6373–6382. IEEE.
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; and Carlos Niebles, J. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 961–970.
- Cordt, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2018. The open images dataset v4: Unified image classification, object detec- tion, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweile, M.; and Schiele, B. 201The cityscapes dataset for semantic urban scene understanding. CVPR.
- Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019. Class-balanced loss based on effective number of samples. CVPR.
- Estabrooks, A.; Jo, T.; and Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence.
- Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR, 1933–1941.
- Gao, J.; Kan, C.; and Nevatia, R. 2018. CTAP: Complementary Temporal Action Proposal Generation. arXiv preprint arXiv:1807.04821.
- Gao, J.; Yang, Z.; and Nevatia, R. 2017. Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180.
- Gao, J.; Yang, Z.; Sun, C.; Chen, K.; and Nevatia, R. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In ICCV, 3648–3656. IEEE.
- Girshick, R. 2015. Fast r-cnn. In ICCV, 1440–1448.
- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- Ioffe, S.; and Szegedy, C. 20Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Jiang, Y.; Liu, J.; Zamir, A. R.; Toderici, G.; Laptev, I.; Shah, M.; and Sukthankar, R. 2014. THUMOS challenge: Action recognition with a large number of classes. In Computer Vision-ECCV workshop 2014.
- Lin, T.; Liu, X.; Li, X.; Ding, E.; and Wen, S. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. CoRR abs/1907.09702.
- Lin, T.; Zhao, X.; and Shou, Z. 2017a. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, 988–996. ACM.
- Lin, T.; Zhao, X.; and Shou, Z. 2017b. Temporal Convolution Based Action Proposal: Submission to ActivityNet 2017. arXiv preprint arXiv:1707.06750.
- Lin, T.; Zhao, X.; Su, H.; Wang, C.; and Yang, M. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. arXiv preprint arXiv:1806.02964.
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollar, P. 2017. Focal loss for dense object detection. ICCV.
- Liu, Y.; Ma, L.; Zhang, Y.; Liu, W.; and Chang, S.-F. 2019. Multi-granularity Generator for Temporal Action Proposal. In CVPR, 3604–3613.
- McCarthy, K.; Zabar, B.; and Weiss, G. 2005. Does costsensitive learning beat sampling for classifying rare classes? Proceedings ofthe 1st international workshop on Utilitybased data mining.
- Qiu, Z.; Yao, T.; and Tao, M. 2017. Learning SpatioTemporal Representation with Pseudo-3D Residual Networks. In ICCV, 5534–5542.
- Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241. Springer.
- Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; and Chang, S.-F. 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 1417–14IEEE.
- Shou, Z.; Wang, D.; and Chang, S.-F. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 1049–1058.
- Shrivastava, A.; Gupta, A.; and Girshick, R. 2016. Training region-based object detectors with online hard ex- ample mining. CVPR.
- Simonyan, K.; and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576.
- Singh, G.; and Cuzzolin, F. 2016. Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979.
- Su, H.; Zhao, X.; and Lin, T. 2018. Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization. In ACCV.
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 4489–4497.
- Wang, H.; Klaser, A.; Schmid, C.; and Liu, C.-L. 2011. Action recognition by dense trajectories. In CVPR, 3169–3176. IEEE.
- Wang, H.; and Schmid, C. 2013. Action recognition with improved trajectories. In ICCV, 3551–3558.
- Wang, L.; Xiong, Y.; Lin, D.; and Van Gool, L. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR, volume 2.
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 20–Springer.
- Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Nonlocal Neural Networks. CVPR.
- Weiss, G. M.; McCarthy, K.; and Zabar, B. 2007. Costsensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? Dmin.
- Xiong, Y.; Wang, L.; Wang, Z.; Zhang, B.; Song, H.; Li, W.; Lin, D.; Qiao, Y.; Gool, L. V.; and Tang, X. 2016. CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016. CVPR ActivityNet Workshop.
- Xiong, Y.; Zhao, Y.; Wang, L.; Lin, D.; and Tang, X. 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716.
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; and Lin, D. 2017. Temporal action detection with structured segment networks. In ICCV, volume 2.
- Zhou, Z.; Siddiquee, M. M. R.; Tajbakhsh, N.; and Liang, J. 2018. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. MICCAI.
Full Text
Tags
Comments