GradNet: Gradient-Guided Network for Visual Object Tracking

2989688045, pp. 6162-6171, 2019.

Cited by: 16|Views48
Weibo:
Experiments on four benchmarks show that our method significantly improves the tracking performance compared with other realtime trackers

Abstract:

The fully-convolutional siamese network based on template matching has shown great potentials in visual tracking. During testing, the template is fixed with the initial target feature and the performance totally relies on the general matching ability of the siamese network. However, this manner cannot capture the temporal variations of ...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Visual object tracking is an important topic in computer vision, where the target object is identified in the initial video frame and successively tracked in subsequent frames.
  • The first group [36, 28, 32, 4] improves the discriminative ability of deep networks by frequent online update.
  • They utilize the first frame to initialize the model and update it every few frames.
  • The speed of these trackers generally cannot meet the real-time requirements
Highlights
  • Visual object tracking is an important topic in computer vision, where the target object is identified in the initial video frame and successively tracked in subsequent frames
  • Different trackers are ranked based on the expected average overlap (EAO) criterion
  • We propose a GradNet for template update, achieving accurate tracking with a high speed
  • To take full use of gradients and obtain versatile templates, a template generalization method is applied during offline training, which can force the update branch to concentrate on the gradient and avoid overfitting
  • Experiments on four benchmarks show that our method significantly improves the tracking performance compared with other realtime trackers
Methods
  • The authors' tracker is implemented in Python with the Pytorch framework, which runs at 80fps with an intel i7 3.2GHz CPU with 32G memory and a Nvidia 1080ti GPU with 11G memory.
  • The authors compare the tracker with many state-of-theart trackers with real-time performance on recent benchmarks, including OTB2015 [38], TC-128 [25], VOT-2017 [21] and LaSOT [12]
Results
  • Evaluation on the OTB

    2015 dataset

    The OTB-2015 [38] dataset is one of the most popular benchmarks, which consists of 100 challenging video clips annotated with 11 different attributes.
  • The authors refer the reader to [38] for more detailed information
  • The authors adopt both success and precision plots to evaluate different trackers on OTB-2015.
  • The TC128 [25] dataset consists of 128 fully-annotated image sequences with 11 various challenging factors, which is larger than OTB-2015 and focuses more on color information
  • The authors adopt both success and precision plots to evaluate different trackers.
  • MDNet and VITAL achieve better accuracies than the tracking algorithm, their speeds are far from the realtime requirement (MDNet, 1fps and VITAL, 1.5fps)
Conclusion
  • The authors propose a GradNet for template update, achieving accurate tracking with a high speed.
  • The two sub-nets in GradNet exploits the discriminative information in gradients through feed-forward and backward operations and speeds up the hand-designed optimization process.
  • To take full use of gradients and obtain versatile templates, a template generalization method is applied during offline training, which can force the update branch to concentrate on the gradient and avoid overfitting.
  • Experiments on four benchmarks show that the method significantly improves the tracking performance compared with other realtime trackers.
Summary
  • Introduction:

    Visual object tracking is an important topic in computer vision, where the target object is identified in the initial video frame and successively tracked in subsequent frames.
  • The first group [36, 28, 32, 4] improves the discriminative ability of deep networks by frequent online update.
  • They utilize the first frame to initialize the model and update it every few frames.
  • The speed of these trackers generally cannot meet the real-time requirements
  • Objectives:

    The authors' goal is to let S have the highest value at the target position and lower values at other positions.
  • The authors' goal is forcing the update branch to focus on gradients and avoiding overfitting
  • Methods:

    The authors' tracker is implemented in Python with the Pytorch framework, which runs at 80fps with an intel i7 3.2GHz CPU with 32G memory and a Nvidia 1080ti GPU with 11G memory.
  • The authors compare the tracker with many state-of-theart trackers with real-time performance on recent benchmarks, including OTB2015 [38], TC-128 [25], VOT-2017 [21] and LaSOT [12]
  • Results:

    Evaluation on the OTB

    2015 dataset

    The OTB-2015 [38] dataset is one of the most popular benchmarks, which consists of 100 challenging video clips annotated with 11 different attributes.
  • The authors refer the reader to [38] for more detailed information
  • The authors adopt both success and precision plots to evaluate different trackers on OTB-2015.
  • The TC128 [25] dataset consists of 128 fully-annotated image sequences with 11 various challenging factors, which is larger than OTB-2015 and focuses more on color information
  • The authors adopt both success and precision plots to evaluate different trackers.
  • MDNet and VITAL achieve better accuracies than the tracking algorithm, their speeds are far from the realtime requirement (MDNet, 1fps and VITAL, 1.5fps)
  • Conclusion:

    The authors propose a GradNet for template update, achieving accurate tracking with a high speed.
  • The two sub-nets in GradNet exploits the discriminative information in gradients through feed-forward and backward operations and speeds up the hand-designed optimization process.
  • To take full use of gradients and obtain versatile templates, a template generalization method is applied during offline training, which can force the update branch to concentrate on the gradient and avoid overfitting.
  • Experiments on four benchmarks show that the method significantly improves the tracking performance compared with other realtime trackers.
Tables
  • Table1: The number of backward iterations to update the template of SiameseFC. ‘LR’ means learning rate; ‘n×’ means n times the basic learning rate; ‘ITERs’ means the needed iterations to converge. There is no proper step to converge by one iteration
  • Table2: The accuracy (A), robustness (R) and expected average overlap (EAO) scores of different trackers on VOT2017
  • Table3: Precision and success scores on OTB-2015 for different variations of our algorithm
Download tables as Excel
Related work
  • 2.1. Siamese Network based Tracking

    SiameseFC [3] is the most representative trackers based on template matching. Bertinetto et al [3] present a siamese network with two shared branches to extract features of both the target and the search region. During online tracking, the template is fixed as the initial target feature and the tracking performance mainly relies on the discriminative ability of the offline-trained network. Without online updating, the tracker achieves beyond real-time speed. Similarly, SINT [33] also designs a network to match the initial target with candidates in a new frame. Its speed is much lower because hundreds of candidate patches are sent into the network instead of one search image. Another siamesebased tracker is GOTURN [17] which proposes a siamese network to regress the target bounding box with a speed of 100fps. All these methods are lack of important online updating. The fixed model cannot adapt to appearance variations, which makes the tracker easily disturbed by similar instances or background noise. In this paper, we choose SiameseFC as our basic model and propose a gradient-guided method to update the template.
Funding
  • Extensive experiments on recent benchmarks demonstrate that our method achieves better performance than other state-of-the-art trackers
  • Our tracker performs significantly better than the baseline model (SiameseFC) by almost 8% in precision and 6% in success
  • Table 2 shows that our tracker achieves the best performance in terms of EAO while maintaining a very competitive accuracy and robustness
  • Experiments on four benchmarks show that our method significantly improves the tracking performance compared with other realtime trackers
Study subjects and analysis
image pairs: 4
Based on these requirements, we propose a template generalization method which adopts search regions from different videos to obtain a versatile template and make it perform well on all search regions in each training batch. We show the training process of our model without template generalization (a) and our model with template generalization (b) in Figure 5 based on four image pairs. The main difference is that we utilize one template (instead of four templates) to search targets on four images from different videos

Reference
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik, and Philip H. S. Torr. Staple: Complementary learners for real-time tracking. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Boyu Chen, Peixia Li, Chong Sun, Dong Wang, Gang Yang, and Huchuan Lu. Multi attention module for visual tracking. Pattern Recognition, 87:80–93, 2019.
    Google ScholarLocate open access versionFindings
  • Boyu Chen, Dong Wang, Peixia Li, and Huchuan Lu. Realtime ’actor-critic’ tracking. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Kenan Dai, Dong Wang, Huchuan Lu, Chong Sun, and Jianhua Li. Visual tracking via adaptive spatially-regularized correlation filters. In ICCV, 2019.
    Google ScholarFindings
  • Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. ECO: efficient convolution operators for tracking. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Accurate scale estimation for robust visual tracking. In BMVC, 2014.
    Google ScholarFindings
  • Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Learning spatially regularized correlation filters for visual tracking. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, and Michael Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121– 2159, 2011.
    Google ScholarLocate open access versionFindings
  • Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. LaSOT: A high-quality benchmark for large-scale single object tracking. CoRR, abs/1809.07845, 2018.
    Findings
  • Heng Fan and Haibin Ling. Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey. Learning background-aware correlation filters for visual tracking. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. Learning dynamic siamese network for visual object tracking. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • David Held, Sebastian Thrun, and Silvio Savarese. Learning to track at 100 fps with deep regression networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Joao F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
    Google ScholarLocate open access versionFindings
  • Chen Huang, Simon Lucey, and Deva Ramanan. Learning policies for adaptive tracking with deep feature cascades. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
    Findings
  • Matej Kristan, Ales Leonardis, and et al. Jiri Matas. The visual object tracking VOT2017 challenge results. In ICCVW, 2017.
    Google ScholarLocate open access versionFindings
  • Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Peixia Li, Dong Wang, Lijun Wang, and Huchuan Lu. Deep visual tracking: Review and experimental comparison. Pattern Recognition, 76:323–338, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Metasgd: Learning to learn quickly for few shot learning. CoRR, abs/1707.09835, 2017.
    Findings
  • Pengpeng Liang, Erik Blasch, and Haibin Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12):5630–5644, 2015.
    Google ScholarLocate open access versionFindings
  • Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. Hierarchical convolutional features for visual tracking. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Chao Ma, Xiaokang Yang, Chongyang Zhang, and MingHsuan Yang. Long-term correlation tracking. In CVPR, 2015.
    Google ScholarFindings
  • Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Eunbyung Park and Alexander C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. CoRR, abs/1807.05960, 2018.
    Findings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • Yibing Song, Chao Ma, Lijun Gong, Jiawei Zhang, Rynson W. H. Lau, and Ming-Hsuan Yang. CREST: convolutional residual learning for visual tracking. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Ran Tao, Efstratios Gavves, and Arnold W M Smeulders. Siamese instance search for tracking. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
    Google ScholarLocate open access versionFindings
  • Jack Valmadre, Luca Bertinetto, Joao F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. End-to-end representation learning for correlation filter based tracking. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. Visual tracking with fully convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. Sequentially training convolutional networks for visual tracking. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
    Google ScholarLocate open access versionFindings
  • Bin Yan, Haojie Zhao, Dong Wang, Huchuan Lu, and Xiaoyun Yang. ‘skimming-perusal’ tracking: A framework for real-time and robust long-term tracking. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tianyu Yang and Antoni B. Chan. Learning dynamic memory networks for object tracking. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Jianming Zhang, Shugao Ma, and Stan Sclaroff. MEEM: robust tracking via multiple experts using entropy minimization. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Tianzhu Zhang, Si Liu, Changsheng Xu, Bin Liu, and MingHsuan Yang. Correlation particle filter for visual tracking. IEEE Transactions on Image Processing, 27(6):2676–2687, 2018.
    Google ScholarLocate open access versionFindings
  • Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang. Learning multi-task correlation particle filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):365–378, 2019.
    Google ScholarLocate open access versionFindings
  • Yunhua Zhang, Lijun Wang, Dong Wang, Mengyang Feng, Huchuan Lu, and Jinqing Qi. Structured siamese network for real-time visual tracking. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Zheng Zhu, Wei Wu, Wei Zou, and Junjie Yan. End-to-end flow correlation tracking with spatial-temporal attention. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments