Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation

Bin Yan
Bin Yan
Xinyu Zhang
Xinyu Zhang
Xiaoyun Yang
Xiaoyun Yang
Cited by: 3|Views23
Weibo:
By exploring multiple design options, we find that extracting and maintaining precise spatial information is the key to the precise box estimation

Abstract:

Visual object tracking aims to precisely estimate the bounding box for the given target, which is a challenging problem due to factors such as deformation and occlusion. Many recent trackers adopt the multiple-stage tracking strategy to improve the quality of bounding box estimation. These methods first coarsely locate the target and th...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Precise box estimation is indispensable for a successful tracker. Early trackers usually solve this problem by multiscale search [1, 4, 35, 3] or sampling--regression strategy [37, 28], which are inaccurate and greatly limit the performance of trackers.
  • P 0.745 0.604 0.734 0.533 0.919 tracking results, many state-of-the-art trackers [40, 8, 5, 2] adopt a multiple-stage tracking strategy, which introduces additional tracking stages for more precise box estimation.
  • These trackers first coarsely locate the target and refine the initial result in the additional tracking stages to get more precise box prediction.
  • Given the perfectly centered search region, the proposed AlphaRefine achieves significantly better performance, demonstrating that Alpha-Refine’s superiority in box estimation
Highlights
  • Precise box estimation is indispensable for a successful tracker
  • We find that extracting and maintaining precise spatial information is the key to precise box estimation
  • Flexible, and accurate refinement module named Alpha-Refine, which can efficiently refine the base tracker’s outputs and significantly improve the tracking performance
  • We propose a novel Alpha-Refine method for visual tracking, which is an accurate and general refinement module to effectively improve the tracking performance of different types of trackers in a plug-and-play style
  • By exploring multiple design options, we find that extracting and maintaining precise spatial information is the key to the precise box estimation
Methods
  • AUC PNorm P AUC PNorm P

    DiMPsuper 63.7 72.5 65.6 65.3 73.2 68.0 up to 19%. The previous best tracker is Siam R-CNN [39], which obtains a 64.8% AUC score but merely runs around 5 fps.
  • Note that the best tracker, DiMPsuper+AR, obtains real-time performance.
  • + AR(ResNet-50) 56.2 46.5 21.5ms 6.6ms + AR(ResNet-34) 55.9 50.0 20.0ms 5.1ms + AR(ResNet-18) 55.0 52.4 19.1ms 4.2ms
Results
  • The authors can see that the Alpha-Refine module facilitates the tracker obtaining more precise bounding boxes than IoU-Net and SiamMask.
  • More visual results are presented in the supplementary material
Conclusion
  • The authors propose a novel Alpha-Refine method for visual tracking, which is an accurate and general refinement module to effectively improve the tracking performance of different types of trackers in a plug-and-play style.
  • Alpha-Refine adopts a precise pixel-wise correlation layer, a Key-Point style prediction head, and an auxiliary mask head.
  • The authors apply the Alpha-Refine model to six well-known and topperformed trackers and conduct numerous evaluations on four popular benchmarks.
  • Strate that the Alpha-Refine could consistently improve the tracking performance with few computational loads
Summary
  • Introduction:

    Precise box estimation is indispensable for a successful tracker. Early trackers usually solve this problem by multiscale search [1, 4, 35, 3] or sampling--regression strategy [37, 28], which are inaccurate and greatly limit the performance of trackers.
  • P 0.745 0.604 0.734 0.533 0.919 tracking results, many state-of-the-art trackers [40, 8, 5, 2] adopt a multiple-stage tracking strategy, which introduces additional tracking stages for more precise box estimation.
  • These trackers first coarsely locate the target and refine the initial result in the additional tracking stages to get more precise box prediction.
  • Given the perfectly centered search region, the proposed AlphaRefine achieves significantly better performance, demonstrating that Alpha-Refine’s superiority in box estimation
  • Methods:

    AUC PNorm P AUC PNorm P

    DiMPsuper 63.7 72.5 65.6 65.3 73.2 68.0 up to 19%. The previous best tracker is Siam R-CNN [39], which obtains a 64.8% AUC score but merely runs around 5 fps.
  • Note that the best tracker, DiMPsuper+AR, obtains real-time performance.
  • + AR(ResNet-50) 56.2 46.5 21.5ms 6.6ms + AR(ResNet-34) 55.9 50.0 20.0ms 5.1ms + AR(ResNet-18) 55.0 52.4 19.1ms 4.2ms
  • Results:

    The authors can see that the Alpha-Refine module facilitates the tracker obtaining more precise bounding boxes than IoU-Net and SiamMask.
  • More visual results are presented in the supplementary material
  • Conclusion:

    The authors propose a novel Alpha-Refine method for visual tracking, which is an accurate and general refinement module to effectively improve the tracking performance of different types of trackers in a plug-and-play style.
  • Alpha-Refine adopts a precise pixel-wise correlation layer, a Key-Point style prediction head, and an auxiliary mask head.
  • The authors apply the Alpha-Refine model to six well-known and topperformed trackers and conduct numerous evaluations on four popular benchmarks.
  • Strate that the Alpha-Refine could consistently improve the tracking performance with few computational loads
Tables
  • Table1: Oracle experiment on LaSOT. The center of the search region is always set at the center of the ground truth, reflecting the estimation capacity of these methods. The best three results are marked in red, green and blue bold fonts respectively
  • Table2: Comparison results on the LaSOT test set. ‘Base’: the base tracker; and ‘Base+AR’: the base tracker with Alpha-Refine
  • Table3: Latency and speed of different methods. The tracking speed is measured using frame per second (fps)
  • Table4: Analysis of different head options. The best three results are marked in red, green and blue bold fonts, respectively. Numbers are shown in percentage (%)
  • Table5: Analysis of different feature fusion types. Naive indicates the typical feature correlation between reference and test branches. Numbers are shown in percentage (%)
  • Table6: Comparison of different refinement modules. The best result is marked in red bold fonts
  • Table7: Accuracy and Speed Comparison of SiamRPNpp+AR with different backbones
  • Table8: Comparison results on the TrackingNet test set. ‘Base’
  • Table9: Comparison results on the GOT-10K test set. ‘Base’: the base tracker; and ‘Base+AR’: the base tracker with Alpha-Refine
  • Table10: Comparison results on the VOT2020 benchmark. ‘Base’
  • Table11: Comparison with other refinement module on the LaSOT test set. “Base tracker-AR” represents base tracker with AlphaRefine. ARm0.3 stand for the output of mask head with threshold 0.3 is used as the refinement result. The best three results are marked in red, green and blue bold fonts respectively. † and ‡ indicate Alpha-Refine modules trained with a lite version of training set which intersects with IoUNet’s and SiamMask’s training set respectively. Better viewed in color with zoom-in
  • Table12: Comparison on the VOT2018 benchmark. The best three results are marked in red, green and blue bold fonts respectively
  • Table13: Comparison results on the OTB2015 benchmark. ‘Base’: the base tracker; and ‘Base+AR’: the base tracker with AlphaRefine. The best three results are marked in red, green and blue bold fonts, respectively. Numbers are shown in percentage (%)
  • Table14: Comparison results on the NfS benchmark. ‘Base’: the base tracker; and ‘Base+AR’: the base tracker with Alpha-Refine. The best three results are marked in red, green and blue bold fonts, respectively. Numbers are shown in percentage (%)
  • Table15: Comparison results on the Temple Color-128 benchmark. ‘Base’: the base tracker; and ‘Base+AR’: the base tracker with Alpha-Refine. The best three results are marked in red, green and blue bold fonts, respectively. Numbers are shown in percentage (%)
Download tables as Excel
Related work
  • Early Box Estimation. Early box estimation methods are mainly scale estimation, which can be summarized into two categories: multiple-scale search and sampling-thenregression strategies. Most correlation-filter-based trackers [10, 4, 35] and SiamFC [1] adopt the former strategy. Specifically, these trackers construct search regions with different sizes, then compute correlation with the template, and finally determine the size of the target as the size-level where the highest response locates. Multiple-scale search is coarse and time-consuming due to its fixed-aspect-ratio prediction and heavy image pyramid operation. Another type of method first generates several bounding box samples, then uses some methods to choose the best one, and finally apply regression on it to obtain more accurate results. SINT [37], MDNet [28] and RT-MDNet [13] are three representative trackers that exploit this approach.
Funding
  • Comprehensive experiments on TrackingNet, LaSOT, GOT-10K, and VOT2020 benchmarks show that our approach significantly improves the base tracker's performance with little extra latency
  • When the ResNet-18 backbone is used, the latency of our AR model is very low but the corresponding performance is also 7.4% higher than the original SiamPRNpp
Study subjects and analysis
samples: 8
where λ = 1000 is used in the experiments. We train AlphaRefine for 40 epochs, each of which consists of 500 iterations on eight Nvidia 2080Ti GPU with a batch size of 32 per GPU (32 × 8 samples per iteration in total). Considering the abundance of the training data, we do not freeze any parameters of the backbone

Reference
  • Luca Bertinetto, Jack Valmadre, Joao F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. Fully-convolutional siamese networks for object tracking. In ECCVW, 2016. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In ICCV, 2019. 1, 2, 6, 7, 11, 12
    Google ScholarLocate open access versionFindings
  • Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. Unveiling the power of deep tracking. In ECCV, 2018. 1
    Google ScholarLocate open access versionFindings
  • Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ECO: Efficient convolution operators for tracking. In CVPR, 2017. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ATOM: Accurate tracking by overlap maximization. In CVPR, 2019. 1, 2, 6, 7, 11
    Google ScholarLocate open access versionFindings
  • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In ICCV, 2019. 4
    Google ScholarLocate open access versionFindings
  • Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. LaSOT: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019. 2, 5, 6, 7, 11
    Google ScholarLocate open access versionFindings
  • Heng Fan and Haibin Ling. Siamese cascaded region proposal networks for real-time visual tracking. In CVPR, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1134–1143. IEEE Computer Society, 2017. 12
    Google ScholarLocate open access versionFindings
  • Joao F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. In ICVS, 2008. 2
    Google ScholarFindings
  • Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI, 2019. 2, 5, 6, 8, 11
    Google ScholarLocate open access versionFindings
  • Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. Real-time MDNet. In ECCV, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 206
    Findings
  • Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv preprint arXiv:1904.03797, 2019. 2
    Findings
  • Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka Cehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al. The sixth visual object tracking vot2018 challenge results. In ECCVW, 2018. 5
    Google ScholarLocate open access versionFindings
  • Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Cehovin Zajc, Martin Danelljan, Alan Lukezic, Ondrej Drbohlav, Linbo He, Yushan Zhang, Song Yan, Jinyu Yang, Gustavo Fernandez, and et al. The eighth visual object tracking vot2020 challenge results. In ECCVW, 2020. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Cehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg, et al. The seventh visual object tracking vot2019 challenge results. In ICCVW, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018. 2, 4
    Google ScholarLocate open access versionFindings
  • Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019. 1, 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In CVPR, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Pengpeng Liang, Erik Blasch, and Haibin Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process., 24(12):5630– 5644, 2015. 13
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 5, 11
    Google ScholarLocate open access versionFindings
  • Alan Lukezic, Jiri Matas, and Matej Kristan. D3S - A discriminative single shot segmentation tracker. In CVPR, 2020. 2, 5
    Google ScholarLocate open access versionFindings
  • Diogo C Luvizon, Hedi Tabia, and David Picard. Human pose regression by combining indirect part detection and contextual information. Computers & Graphics, 85:15–22, 2019. 4
    Google ScholarLocate open access versionFindings
  • Ziang Ma, Linyuan Wang, Haitao Zhang, Wei Lu, and Jun Yin. RPT: learning point set representation for siamese visual tracking. CoRR, abs/2008.03467, 2020. 12
    Findings
  • Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In ECCV, 2018. 2, 6, 7, 11
    Google ScholarLocate open access versionFindings
  • Hyeonseob Nam and Bohyung Han. Learning multi–domain convolutional neural networks for visual tracking. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 6
    Google ScholarFindings
  • Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017. 11
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R–CNN: towards real-time object detection with region proposal networks. In NIPS, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. UNet: convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells III, and Alejandro F. Frangi, editors, MICCAI, 2015. 5
    Google ScholarFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. ImageNet Large scale visual recognition challenge. IJCV, 2015. 5, 11
    Google ScholarLocate open access versionFindings
  • Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. Hierarchical image saliency detection on extended cssd. TPAMI, 38(4), 2015. 5
    Google ScholarLocate open access versionFindings
  • Chong Sun, Dong Wang, Huchuan Lu, and Ming-Hsuan Yang. Correlation tracking via joint discrimination and reliability learning. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ran Tao, Efstratios Gavves, and Arnold W. M. Smeulders. Siamese instance search for tracking. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, and Bastian Leibe. Siam R-CNN: visual tracking by re-detection. In CVPR. 6, 8, 12, 13
    Google ScholarLocate open access versionFindings
  • Guangting Wang, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. SPM-tracker: Series-parallel matching for real-time visual object tracking. In CVPR, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In CVPR, 2017. 5
    Google ScholarLocate open access versionFindings
  • Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H. S. Torr. Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019. 2, 3, 5, 6, 7, 11
    Google ScholarLocate open access versionFindings
  • Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. Ranet: Ranking attention network for fast video object segmentation. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yi Wu, Jongwoo Lim, and Ming Hsuan Yang. Object tracking benchmark. TPAMI, 37(9):1834–1848, 2015. 12
    Google ScholarLocate open access versionFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV, pages 585–601, 2018. 5, 11
    Google ScholarLocate open access versionFindings
  • Yinda Xu, Zeyu Wang, Zuoxin Li, Yuan Ye, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188, 2019. 2
    Findings
  • Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In CVPR, 2013. 5
    Google ScholarFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. 2, 4
    Findings
  • Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019. 4
    Google ScholarLocate open access versionFindings
  • Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware siamese networks for visual object tracking. In ECCV, 2018. 2, 3
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments