AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The learned object query detects objects in the current frame and object feature query from the previous frame associates objects in the current frame with the previous ones

TransTrack: Multiple-Object Tracking with Transformer

Cited by: 4|Views328
Full Text
Bibtex
Weibo

Abstract

Multiple-object tracking(MOT) is mostly dominated by complex and multi-step tracking-by-detection algorithm, which performs object detection, feature extraction and temporal association, separately. Query-key mechanism in single-object tracking(SOT), which tracks the object of the current frame by object feature of the previous frame, h...More

Code:

Data:

0
Introduction
  • Video-based scene understanding and human behavior analysis are essential for current computer vision systems to understand the world at a high level.
  • (a) Complex tracking-by-detection MOT pipeline.
  • (c) Query-key pipeline has great potential to setup a simple MOT method.
  • It will miss new-coming objects.
  • The dominant MOT method is the complex multi-step tracking-by-detection pipeline.
  • Query-key mechanism in SOT pipeline is potential to set up a much simple MOT pipeline, it will miss new-coming objects.
  • TransTrack is aimed to take advantage of query-key mechanism and to detect new-coming objects.
Highlights
  • Video-based scene understanding and human behavior analysis are essential for current computer vision systems to understand the world at a high level
  • We introduce an online joint-detection-and-tracking Multiple-Object Tracking (MOT) pipeline based on query-key mechanism, simplifying complex and multi-step components in the previous methods
  • The learned object query detects objects in the current frame and object feature query from the previous frame associates objects in the current frame with the previous ones
  • Our method achieves competitive 65.8% MOTA on the MOT17 challenge dataset
  • Query-key mechanism is widely used in the field of Single-Object Tracking (SOT), but is seldom studied in MOT
  • We believe that it is a promising direction to further improve the overall performance of TransTrack
  • We demonstrate query-key mechanism could serve as an effective and strong baseline for MOT
Methods
  • Following [5], original Transformer architecture is built on feature map of res stage [12].
  • To increase the feature resolution, the authors apply dilation convolution to res stage and remove a stride from the first convolution of this stage, called Transformer-DC5 [5].
  • This design yields an obvious 3.6 MOTA improvement.
  • For the whole video sequence, most of the objects will be missed and FN metric falls off, shown in the second row of Table 3
Results
  • The authors present that the method achieves comparable performance with state-of-the-art models without bells and whistles.
  • The authors believe that it is a promising direction to further improve the overall performance of TransTrack.
  • The authors' method achieves competitive 65.8% MOTA on the MOT17 challenge dataset
Conclusion
  • The authors set up a simple joint-detection-and-tracking MOT pipeline, TransTrack, based on query-key mechanism.
  • The image feature maps are common keys among queries.
  • The learned object query detects objects in the current frame and object feature query from the previous frame associates objects in the current frame with the previous ones.
  • Query-key mechanism is widely used in the field of SOT, but is seldom studied in MOT.
  • The authors demonstrate query-key mechanism could serve as an effective and strong baseline for MOT
Tables
  • Table1: Ablation study on external training data. 1st row is the model trained only on CrowdHuman dataset. 2nd row is the model trained only on split training set of MOT dataset. 3rd row is the model trained on CrowdHuman dataset first and then on split training set of MOT dataset. All models are tested on split validation set of MOT dataset
  • Table2: Ablation study on Transformer architecture. Original transformer suffers from low feature resolution. Deformable DETR with multi-scale feature input achieves best performance
  • Table3: Ablation study on input query. Only learned object query obtains limited association performance. Only object feature query from the previous frame leads to numerous FN since it misses new-coming objects. Both achieve to best detection and tracking performance
  • Table4: Evaluation on MOT17 test sets. We list published results of both public and private detection and compare TransTrack with methods of private detection. TransTrack performs excellent performance in terms of MOTP and FN, proving the success of introducing learned object query into the pipeline. IDs of TransTrack is comparable with ChainedTracker, which shows the effectiveness of object feature query in associating two adjacent frames
Download tables as Excel
Related work
  • As our work is to introduce query-key mechanism into the multi-object tracking model, we first review the applications of query-key mechanism in object detection and single-object tracking, then dive into related works on multi-object tracking.

    Query-Key mechanism in Object Detection. Querykey mechanism has been successfully applied in object detection areas for its entities of self-attention and crossattention [36], i.e., Relation Network [14], DETR [5], Deformable DETR [48]. Among them, DETR reasons about the relations of the object queries and the global image context to directly output the final set of predictions in parallel. DETR streamlines the detection pipeline, effectively removing the need for non-maximum suppression procedures and anchor generation.

    We notice that these object detection frameworks can be intuitively applied to multiple-object tracking pipeline to provide object detection.
Funding
  • • We present that our method achieves comparable performance with state-of-the-art models without bells and whistles
  • We believe that it is a promising direction to further improve the overall performance of TransTrack
  • Our method achieves competitive 65.8% MOTA on the MOT17 challenge dataset
Reference
  • Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking without bells and whistles. In ICCV, pages 941– 951, 2019. 4, 5, 8
    Google ScholarLocate open access versionFindings
  • Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1– 10, 2008. 6
    Google ScholarLocate open access versionFindings
  • Luca Bertinetto, Jack Valmadre, Joao F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. Fully-convolutional siamese networks for object tracking, 2016. 2, 3
    Google ScholarFindings
  • Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016. 3
    Google ScholarLocate open access versionFindings
  • Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toEnd object detection with transformers. In ECCV, 2020. 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Jiahui Chen, Hao Sheng, Yang Zhang, and Zhang Xiong. Enhancing detection model for multiple hypothesis tracking. In PCVPRW, pages 18–27, 2017. 8
    Google ScholarLocate open access versionFindings
  • Long Chen, Haizhou Ai, Zijie Zhuang, and Chong Shang. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, 2018. 3, 8
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect, 2018. 4
    Google ScholarFindings
  • Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2019. 5
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010. 6
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6
    Google ScholarLocate open access versionFindings
  • Roberto Henschel, Laura Leal-Taixe, Daniel Cremers, and Bodo Rosenhahn. Fusion of head and full-body detectors for multi-object tracking. In CVPRW, pages 1428–1437, 2018. 8
    Google ScholarLocate open access versionFindings
  • Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 206
    Findings
  • Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, Oct 2018. 4
    Google ScholarLocate open access versionFindings
  • Margret Keuper, Siyu Tang, Bjoern Andres, Thomas Brox, and Bernt Schiele. Motion segmentation & multiple object tracking by correlation co-clustering. TPAMI, 42(1):140– 153, 2018. 8
    Google ScholarLocate open access versionFindings
  • Chanho Kim, Fuxin Li, and James M Rehg. Multi-object tracking with neural gating using bilinear lstm. In ECCV, pages 200–215, 208
    Google ScholarLocate open access versionFindings
  • Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 3, 5
    Google ScholarLocate open access versionFindings
  • Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks, 2018. 2, 3
    Google ScholarFindings
  • B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 3, 6
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection, 2018. 3, 5
    Google ScholarFindings
  • Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6
    Findings
  • Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 2, 6
    Findings
  • Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6308– 6318, 2020. 8
    Google ScholarLocate open access versionFindings
  • Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. arXiv preprint arXiv:2007.14557, 2020. 4, 5, 8
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016. 3, 4
    Google ScholarFindings
  • Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019. 5
    Google ScholarLocate open access versionFindings
  • Chaobing Shan, Chunbo Wei, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Xiaoliang Cheng, and Kewei Liang. Fgagt: Flow-guided adaptive graph tracking. arXiv preprint arXiv:2010.09015, 2020. 6
    Findings
  • Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018. 6
    Findings
  • Peize Sun, Yi Jiang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Onenet: Towards end-to-end one-stage object detection. arXiv preprint arXiv:2012.05780, 2020. 5
    Findings
  • Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, and Ping Luo. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450, 2020. 5
    Findings
  • Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Ran Tao, Efstratios Gavves, and Arnold W. M. Smeulders. Siamese instance search for tracking, 2016. 2, 3
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. arXiv preprint arXiv:2012.03544, 2020. 5
    Findings
  • Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H. S. Torr. Fast online object tracking and segmentation: A unifying approach, 2019. 3
    Google ScholarFindings
  • Zhongdao Wang, Liang Zheng, Yixuan Liu, and Shengjin Wang. Towards real-time multi-object tracking. arXiv preprint arXiv:1909.12605, 2019. 2, 3, 4
    Findings
  • Greg Welch, Gary Bishop, et al. An introduction to the kalman filter, 1995. 3
    Google ScholarFindings
  • Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In ICIP, pages 3645–3649. IEEE, 2017. 3, 8
    Google ScholarLocate open access versionFindings
  • Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi, and Junjie Yan. Poi: Multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision, pages 36–Springer, 2016. 2, 3
    Google ScholarLocate open access versionFindings
  • Yifu Zhan, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. A simple baseline for multi-object tracking. arXiv preprint arXiv:2004.01888, 2020. 3, 4, 6
    Findings
  • Zheng Zhang, Dazhi Cheng, Xizhou Zhu, Stephen Lin, and Jifeng Dai. Integrated object detection and tracking with tracklet-conditioned detection, 2018. 4
    Google ScholarFindings
  • Xingyi Zhou, Vladlen Koltun, and Philipp Krahenbuhl. Tracking objects as points, 2020. 4, 5, 6, 8
    Google ScholarFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points, 2019. 4
    Google ScholarFindings
  • Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang, and Ming-Hsuan Yang. Online multi-object tracking with dual matching attention networks. In ECCV, pages 366–382, 2018. 3, 8
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 3, 5, 6, 7
    Findings
  • Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. 4
    Google ScholarLocate open access versionFindings
  • Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware siamese networks for visual object tracking, 2018. 3
    Google ScholarFindings
Author
Peize Sun
Peize Sun
Yi Jiang
Yi Jiang
Rufeng Zhang
Rufeng Zhang
Enze Xie
Enze Xie
Tao Kong
Tao Kong
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科