AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We proposed an ensemble framework, namely DTNet, composed of a decision module and a tracker module for visual tracking

Online Decision Based Visual Tracking via Reinforcement Learning

NIPS 2020, (2020)

Cited by: 0|Views611
EI
Full Text
Bibtex
Weibo

Abstract

A deep visual tracker is typically based on either object detection or template matching while each of them is only suitable for a particular group of scenes. It is straightforward to consider fusing them together to pursue more reliable tracking. However, this is not wise as they follow different tracking principles. Unlike previous fusi...More

Code:

Data:

0
Introduction
  • As a fundamental task in computer vision, visual tracking aims to estimate the trajectory of a specified object in a sequence of images.
  • Inspired by the success of deep learning in general computer vision tasks, recent visual tracking algorithms mostly used deep networks, CNNs which extract deep representations for various scenes.
  • Among these deep trackers are two dominant tracking schemes.
  • The diverse appearances of the patches lead to a good adaptability of the tracker while the continuous update is inefficient for real-world tracking.
  • The template tracker utilizes the initial appearance of the target as a fixed template to conduct the matching operation, which runs efficiently at the cost of adaptability
Highlights
  • As a fundamental task in computer vision, visual tracking aims to estimate the trajectory of a specified object in a sequence of images
  • We provide a designed scheme for jointly training the decision and the tracker modules end-to-end via hierarchical reinforcement learning (HRL)
  • We propose an ensemble framework which learns an online decision for visual tracking based on HRL where the detection and the template trackers compete with each other to substantiate a switching strategy
  • We develop a novel proposal-free detection tracker, which does not require the proposal of candidate bounding boxes of the target and make the discriminating course flexible
  • We proposed an ensemble framework, namely DTNet, composed of a decision module and a tracker module for visual tracking
  • Our automated decision module significantly outperforms such a handcrafted one which relies on handcrafted thresholds for tracker selection
  • Differing from the fusion-based methods, the DTNet could learn an online decision to pick a particular tracker for a particular scene
Methods
  • As shown in Fig. 2, the proposed framework consists of two modules: the decision module and the tracker module.
  • Note that if the termination network decides to terminate, it merely indicates that the current tracker in use does not work well while it does not necessarily means that the other tracker can performs better
  • In this case, the switch network will still select a new tracker from the two candidate trackers instead of blindly switching to the other tracker currently not in use.
Results
  • The authors conduct comparative evaluations on the benchmarks including OTB-2013 [37], OTB-50 [38], OTB-100 [38], LaSOT [12], TrackingNet [24], UAV123 [23] and VOT18 [18] with three considerations: 1) The authors compare the proposed DTNet with state-of-the-art trackers; 2) To demonstrate the effectiveness of the switch module, the authors compare the DTNet with some of its variants by employing different rackers; 3) The authors further compare the method with the trackers fused at the feature level to demonstrate the advantage of the decision-based strategy.
  • Apart from the experimental results shown please refer to the website mentioned in the abstract for the supplementary results including the online visualization of the decision module of the proposed DTNet and the comparison with the state-of-the-art tracking methods.
Conclusion
  • The authors proposed an ensemble framework, namely DTNet, composed of a decision module and a tracker module for visual tracking.
  • By HRL, the decision module enables the detection tracker and the template trackers that form the tracker module to compete with each other so that the DTNet can switch between them for different scenes.
  • Differing from the fusion-based methods, the DTNet could learn an online decision to pick a particular tracker for a particular scene.
Summary
  • Introduction:

    As a fundamental task in computer vision, visual tracking aims to estimate the trajectory of a specified object in a sequence of images.
  • Inspired by the success of deep learning in general computer vision tasks, recent visual tracking algorithms mostly used deep networks, CNNs which extract deep representations for various scenes.
  • Among these deep trackers are two dominant tracking schemes.
  • The diverse appearances of the patches lead to a good adaptability of the tracker while the continuous update is inefficient for real-world tracking.
  • The template tracker utilizes the initial appearance of the target as a fixed template to conduct the matching operation, which runs efficiently at the cost of adaptability
  • Methods:

    As shown in Fig. 2, the proposed framework consists of two modules: the decision module and the tracker module.
  • Note that if the termination network decides to terminate, it merely indicates that the current tracker in use does not work well while it does not necessarily means that the other tracker can performs better
  • In this case, the switch network will still select a new tracker from the two candidate trackers instead of blindly switching to the other tracker currently not in use.
  • Results:

    The authors conduct comparative evaluations on the benchmarks including OTB-2013 [37], OTB-50 [38], OTB-100 [38], LaSOT [12], TrackingNet [24], UAV123 [23] and VOT18 [18] with three considerations: 1) The authors compare the proposed DTNet with state-of-the-art trackers; 2) To demonstrate the effectiveness of the switch module, the authors compare the DTNet with some of its variants by employing different rackers; 3) The authors further compare the method with the trackers fused at the feature level to demonstrate the advantage of the decision-based strategy.
  • Apart from the experimental results shown please refer to the website mentioned in the abstract for the supplementary results including the online visualization of the decision module of the proposed DTNet and the comparison with the state-of-the-art tracking methods.
  • Conclusion:

    The authors proposed an ensemble framework, namely DTNet, composed of a decision module and a tracker module for visual tracking.
  • By HRL, the decision module enables the detection tracker and the template trackers that form the tracker module to compete with each other so that the DTNet can switch between them for different scenes.
  • Differing from the fusion-based methods, the DTNet could learn an online decision to pick a particular tracker for a particular scene.
Tables
  • Table1: Comparison with the variants of DTNet on different benchmarks
  • Table2: Comparison with the fusion-based trackers in terms of AUC
Download tables as Excel
Related work
  • Detection trackers. Trackers based on object detection in each video frame usually learn a classifier to pick up the positive candidate patches wrapping around previous observation. Nam and Han [25] proposed a lightweight CNN to learn generic feature representations by shared convolutional layers to detect the target object. Han et al [14] selected a random subset of branches for model update to diversify learned target appearance models. Fan and Ling [13] took into account self-structural information to learn a discriminative appearance model. Song et al [30] integrated adversarial learning into a tracking-by-detection framework to reduce overfitting on single frames. However, the occasional incorrect detection in a frame is still prone to contaminate and mislead the target appearance models.
Funding
  • We acknowledge the support of the National Key Research and Development Plan of China under Grant 2017YFB1300205, the National Natural Science Foundation of China under Grants 61991411 and U1913204, the Shandong Major Scientific and Technological Innovation Project 2018CXGC1503, the Young Taishan Scholars Program of Shandong Province No.tsqn201909029 and the Qilu Young Scholars Program of Shandong University No.31400082063101
Study subjects and analysis
cases: 3
DIoU is the difference value between them. Actually, three cases are divided by the above setup of reward: (1) One succeeds while the other fails; (2) Both succeed; (3) Both fail. Accordingly, three enlarger coefficients are assigned in descending order, which leads to select the agent with higher accuracy while guides the tracking competition

datasets: 4
We further compare our DTNet with some trackers based on the fusion strategy. According to the quantitative results listed in Table 2, our method exhibits the best performance among real-time trackers on all four datasets. By associating Table 1 with Table 2, we find that although either FCT or SiamFC alone is outperformed by some state-of-the-art fusion-based trackers such as HSME (on OTB-2013) and MCCT-H (on OTB-2013 and OTB-100), the DTNet that combines them in a switching manner through the decision module performs significantly better than them

Reference
  • P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • C. Bailer, A. Pagani, and D. Stricker. A superior tracking approach: Building a strong tracker through fusion. In European Conference on Computer Vision, pages 170–185.
    Google ScholarLocate open access versionFindings
  • L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1401–1409, 2016.
    Google ScholarLocate open access versionFindings
  • L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865.
    Google ScholarLocate open access versionFindings
  • B. Chen, D. Wang, P. Li, S. Wang, and H. Lu. Real-time’actor-critic’tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334, 2018.
    Google ScholarLocate open access versionFindings
  • J. Choi, H. Jin Chang, J. Jeong, Y. Demiris, and J. Young Choi. Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4321–4330, 2016.
    Google ScholarLocate open access versionFindings
  • M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.
    Google ScholarLocate open access versionFindings
  • M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6638–6646, 2017.
    Google ScholarLocate open access versionFindings
  • M. Danelljan, G. Häger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
    Google ScholarLocate open access versionFindings
  • M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision, pages 4310–4318, 2015.
    Google ScholarLocate open access versionFindings
  • X. Dong and J. Shen. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 459–474, 2018.
    Google ScholarLocate open access versionFindings
  • H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5374–5383, 2019.
    Google ScholarLocate open access versionFindings
  • H. Fan and H. Ling. Sanet: Structure-aware network for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 42–49, 2017.
    Google ScholarLocate open access versionFindings
  • B. Han, J. Sim, and H. Adam. Branchout: Regularization for online ensemble tracking with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3356–3365, 2017.
    Google ScholarLocate open access versionFindings
  • A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4834–4843, 2018.
    Google ScholarLocate open access versionFindings
  • S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. In International conference on machine learning, pages 597–606, 2015.
    Google ScholarLocate open access versionFindings
  • O. Khalid, J. C. SanMiguel, and A. Cavallaro. Multi-tracker partition fusion. IEEE Transactions on Circuits and Systems for Video Technology, 27(7):1527–1539, 2016.
    Google ScholarLocate open access versionFindings
  • M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
    Google ScholarLocate open access versionFindings
  • B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
    Google ScholarLocate open access versionFindings
  • J. Li, C. Deng, R. Y. Da Xu, D. Tao, and B. Zhao. Robust object tracking with discrete graph-based multiple experts. IEEE Transactions on Image Processing, 26(6):2736–2750, 2017.
    Google ScholarLocate open access versionFindings
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • M. Mueller, N. Smith, and B. Ghanem. A benchmark and simulator for uav tracking. In European conference on computer vision, pages 445–461.
    Google ScholarLocate open access versionFindings
  • M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018.
    Google ScholarLocate open access versionFindings
  • H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4293–4302, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang. Hedged deep tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4303–4311, 2016.
    Google ScholarLocate open access versionFindings
  • E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke. Youtube-boundingboxes: A large highprecision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5296–5305, 2017.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • K. Song, W. Zhang, W. Lu, Z.-J. Zha, X. Ji, and Y. Li. Visual object tracking via guessing and matching. IEEE Transactions on Circuits and Systems for Video Technology, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. W. Lau, and M.-H. Yang. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8990–8999, 2018.
    Google ScholarLocate open access versionFindings
  • R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1420–1429, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 9627–9636, 2019.
    Google ScholarLocate open access versionFindings
  • J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2805–2813, 2017.
    Google ScholarLocate open access versionFindings
  • N. Wang and D.-Y. Yeung. Ensemble-based tracking: Aggregating crowdsourced structured time series data. In International Conference on Machine Learning, pages 1107–1115, 2014.
    Google ScholarLocate open access versionFindings
  • N. Wang, W. Zhou, Q. Tian, R. Hong, M. Wang, and H. Li. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4844–4853, 2018.
    Google ScholarLocate open access versionFindings
  • Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank. Learning attentions: residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4854–4863, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
    Google ScholarLocate open access versionFindings
  • S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Young Choi. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2711–2720, 2017.
    Google ScholarLocate open access versionFindings
  • J. Zhang, S. Ma, and S. Sclaroff. Meem: robust tracking via multiple experts using entropy minimization. In European conference on computer vision, pages 188–203.
    Google ScholarLocate open access versionFindings
Author
ke Song
ke Song
Ran Song
Ran Song
Your rating :
0

 

Tags
Comments
小科