Dont Even Look Once: Synthesizing Features for Zero-Shot Detection

CVPR, pp. 11690-11699, 2019.

Cited by: 1|Views37
EI
Weibo:
We focus on the generalized Zero-Shot Detection problem where both seen and unseen objects can be present at test-time, but we are only provided examples of seen objects during training

Abstract:

Zero-shot detection, namely, localizing both seen and unseen objects, increasingly gains importance for large-scale applications, with large number of object classes, since, collecting sufficient annotated data with ground truth bounding boxes is simply not scalable. While vanilla deep neural networks deliver high performance for object...More

Code:

Data:

0
Introduction
  • While deep learning based object detection methods have achieved impressive average precision over the last five years [13, 35, 32, 27, 14, 33], these gains can be attributed to the availability of training data in the form of fully annotated ground-truth object bounding boxes.
  • As the authors scale up detection to large-scale applications and “in the wild” scenarios, the demand for bounding-box level annotations across a large number of object classes is not scalable.
  • To understand the root of this issue, the authors note that most detectors, base their detection, on three components, (a) proposing object bounding boxes; (b) outputting objectness score to provide confidence for a candidate bounding box, and to filter out bounding boxes with low confidence; (c) a classification score for recognizing the object in a high-confidence bounding box
Highlights
  • While deep learning based object detection methods have achieved impressive average precision over the last five years [13, 35, 32, 27, 14, 33], these gains can be attributed to the availability of training data in the form of fully annotated ground-truth object bounding boxes
  • The goal of test-seen is to benchmark performance of proposed method against vanilla detectors, which are optimal for this task
  • As such mAP performance gain could be credited to improvements in placing high-confidence bounding boxes in the right places as well as improvements in ZSR algorithm
  • We proposed Dont Even Look Once, a novel Zero-shot detection algorithm for localizing seen and unseen objects
  • We focus on the generalized Zero-Shot Detection problem where both seen and unseen objects can be present at test-time, but we are only provided examples of seen objects during training
  • Our key insight is that, while vanilla DNN detectors are capable of producing bounding boxes on unseen objects, these get filtered out due to poor confidence
Methods
  • YOLOv2 ZS-YOLO DELO YOLOv2 ZS-YOLO DELO

    Pascal VOC split TU TS TM

    MS COCO split TU TS TM.
  • PascalVOC has only 20 classes
  • For this reason, the goal here is to primarily understand how performance varies with different split ratios of seen to unseen objects (5/15, 10/10, and 15/5).
  • On the 10/10 and 15/5 splits, the authors set training epochs to 60 and scale the learning rate by 0.5 every 15 epochs.
  • On the 5/15 split, the training epoch is 200 and the learning rate is scaled by 0.5 every 60 epochs.
  • The learning rate is scaled by 0.5 every 15 epochs
Results
  • ZSD algorithms must be evaluated carefully to properly attribute gains to the different system components
  • For this reason the authors list four principal attributes that are essential for validating performance in this context: (a) Dataset Complexity.
  • At high seen/unseen object class ratios, gains are predominantly a function of recognition algorithm, necessitating no improvement object bounding boxes on unseen objects
  • For this reason, the authors experiment with a number of different splits.
Conclusion
  • The Recall@100 metric helps in this process since 100 bounding boxes typically contain all unseen objects at the high split ratios
  • Once this is guaranteed, background boxes can be eliminated based on post-processing with a zero-shot classifier that rejects background whenever no unseen class is deemed favorable.
  • The authors' key insight is that, while vanilla DNN detectors are capable of producing bounding boxes on unseen objects, these get filtered out due to poor confidence
  • To address this issue DELO synthesizes unseen class visual features, leveraging semantic data.
  • The authors' results show that on a number metrics, on complex datasets involving multiple objects/image, DELO achieves state-ofthe-art performance
Summary
  • Introduction:

    While deep learning based object detection methods have achieved impressive average precision over the last five years [13, 35, 32, 27, 14, 33], these gains can be attributed to the availability of training data in the form of fully annotated ground-truth object bounding boxes.
  • As the authors scale up detection to large-scale applications and “in the wild” scenarios, the demand for bounding-box level annotations across a large number of object classes is not scalable.
  • To understand the root of this issue, the authors note that most detectors, base their detection, on three components, (a) proposing object bounding boxes; (b) outputting objectness score to provide confidence for a candidate bounding box, and to filter out bounding boxes with low confidence; (c) a classification score for recognizing the object in a high-confidence bounding box
  • Methods:

    YOLOv2 ZS-YOLO DELO YOLOv2 ZS-YOLO DELO

    Pascal VOC split TU TS TM

    MS COCO split TU TS TM.
  • PascalVOC has only 20 classes
  • For this reason, the goal here is to primarily understand how performance varies with different split ratios of seen to unseen objects (5/15, 10/10, and 15/5).
  • On the 10/10 and 15/5 splits, the authors set training epochs to 60 and scale the learning rate by 0.5 every 15 epochs.
  • On the 5/15 split, the training epoch is 200 and the learning rate is scaled by 0.5 every 60 epochs.
  • The learning rate is scaled by 0.5 every 15 epochs
  • Results:

    ZSD algorithms must be evaluated carefully to properly attribute gains to the different system components
  • For this reason the authors list four principal attributes that are essential for validating performance in this context: (a) Dataset Complexity.
  • At high seen/unseen object class ratios, gains are predominantly a function of recognition algorithm, necessitating no improvement object bounding boxes on unseen objects
  • For this reason, the authors experiment with a number of different splits.
  • Conclusion:

    The Recall@100 metric helps in this process since 100 bounding boxes typically contain all unseen objects at the high split ratios
  • Once this is guaranteed, background boxes can be eliminated based on post-processing with a zero-shot classifier that rejects background whenever no unseen class is deemed favorable.
  • The authors' key insight is that, while vanilla DNN detectors are capable of producing bounding boxes on unseen objects, these get filtered out due to poor confidence
  • To address this issue DELO synthesizes unseen class visual features, leveraging semantic data.
  • The authors' results show that on a number metrics, on complex datasets involving multiple objects/image, DELO achieves state-ofthe-art performance
Tables
  • Table1: Zero-shot detection evaluation results on various datasets and seen/unseen splits. TU = Test-Unseen, TS = Test-Seen, TM = Test-Mix represents different data configurations. Overall average precision (AP) in % is reported. The highest AP for every setting is in bold
  • Table2: Evaluation on the 10/10 split of Pascal VOC for baseline models. TU = Test-Unseen, TS = Test-Seen, TM = Test-Mix. Overall average precision in % is reported. The difference between original YOLOv2 is reported in (·) and the highest difference is in bold
  • Table3: ZSD and GZSD performance evaluated with Recall@100 and mAP on MS COCO to compare with other ZSD methods. A 2FC classifier trained on Dsyn is appended to YOLOv2 and DELO to conduct the full detection. The number in the parenthesis is class-agnostic recall ignoring classification
Download tables as Excel
Related work
  • Traditional vs. Generalized ZSL (GZSL). Zero Shot Learning (ZSL) seeks to recognize novel visual categories that are unannotated in training data [21, 40, 22, 45]. As such, ZSL exhibits bias towards unseen classes, and GZSL evaluation attempts to rectify it by evaluating on both seen and unseen objects at test-time [3, 43, 12, 17, 15, 46, 37, 47]. Our evaluation for ZSD follows GZSL focusing on both seen and unseen objects. Generative ZSL methods. Semantic information is a key ingredient for transferring knowledge from seen to unseen classes. This can be in the form of attributes [10, 21, 28, 29, 2, 5], word phrases [38, 11], etc. Such semantic data is often easier to collect and the premise of many ZSL methods is to substitute hard to collect visual samples for semantic data. Nevertheless, there is often a large visual-semantic gap, which results in significant performance degradation. Motivated by these concerns, recent works have proposed to synthesize unseen examples by means of generative models such as autoencoders [4, 19], GANs and adversarial methods [49, 20, 42, 16, 23, 36] that take semantic vectors as input, and output images. Following their approach, we propose to similarly bridge visual-semantic gap in ZSD through synthesizing visual features for unseen objects (since visual images are somewhat noisy). Zero-Shot Detection. Recently, a few papers have begun to focus attention on zero-shot detection [1, 31, 24, 30, 48]. Unfortunately, methods, datasets, protocols and splits are all somewhat different, and the software code is not publicly available to comprehensively validate against all the methods. Nevertheless, we will highlight here some of the differences within the context of our evaluation metric (a-d).
Funding
  • This work was supported partly by the National Science Foundation Grant 1527618, the Office of Naval Research Grant N0014-18-1-2257and by a gift from ARM corporation
Reference
  • Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 384–400, 2018. 2, 3, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Gregory Castanon, Mohamed Elgharib, Venkatesh Saligrama, and Pierre-Marc Jodoin. Retrieval in longsurveillance videos using user-described motion and object attributes. IEEE Transactions on Circuits and Systems for Video Technology, 26(12):2313–2327, 2016. 2
    Google ScholarLocate open access versionFindings
  • Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An Empirical Study and Analysis of Generalized ZeroShot Learning for Object Recognition in the Wild, pages 52– 68. Springer International Publishing, Cham, 2016. 2
    Google ScholarFindings
  • Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and ShihFu Chang. Zero-shot visual recognition using semanticspreserving adversarial embedding network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yuting Chen, Joseph Wang, Yannan Bai, Gregory Castanon, and Venkatesh Saligrama. Probabilistic semantic retrieval for surveillance videos with activity graphs. IEEE Transactions on Multimedia, 2018. 2
    Google ScholarLocate open access versionFindings
  • Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. Zero-shot object detection by hybrid region embedding. In BMVC, 2018. 3, 6
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. pages 248–255. IEEE, 2009. 1, 2
    Google ScholarFindings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. 88(2):303–338, 2010. 1, 2
    Google ScholarFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.6
    Locate open access versionFindings
  • Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785. IEEE, 2009. 2, 7
    Google ScholarLocate open access versionFindings
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013. 2
    Google ScholarLocate open access versionFindings
  • Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. Recent advances in zeroshot recognition. arXiv preprint arXiv:1710.04837, 2017. 2
    Findings
  • Ross Girshick. Fast r-cnn. pages 1440–1448, 2015. 1
    Google ScholarFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. 2017. 1
    Google ScholarFindings
  • He Huang, Changhu Wang, Philip S. Yu, and Chang-Dong Wang. Generative dual adversarial network for generalized zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
    Google ScholarLocate open access versionFindings
  • He Huang, Changhu Wang, Philip S Yu, and Chang-Dong Wang. Generative dual adversarial network for generalized zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 801– 810, 2019. 2
    Google ScholarLocate open access versionFindings
  • Huajie Jiang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Transferable contrastive network for generalized zeroshot learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019. 2
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 4
    Findings
  • Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345, 2017. 2
    Findings
  • Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot learning via synthesized examples. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2
    Google ScholarLocate open access versionFindings
  • Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014. 2
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4247–4255, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. Leveraging the invariant side of generative zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7402–7411, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhihui Li, Lina Yao, Xiaoqin Zhang, Xianzhi Wang, Salil Kanhere, and Huaxiang Zhang. Zero-shot object detection with textual descriptions. In Proceedings of AAAI, 2019. 2, 5, 7
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. pages 21–37. Springer, 2016. 1
    Google ScholarFindings
  • Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. Computer Vision–ECCV 2012, pages 488–501, 2012. 2
    Google ScholarLocate open access versionFindings
  • Devi Parikh and Kristen Grauman. Interactively building a discriminative vocabulary of nameable attributes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1681–1688. IEEE, 2011. 2
    Google ScholarLocate open access versionFindings
  • Shafin Rahman, Salman Khan, and Nick Barnes. Transductive learning for zero-shot object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6082–6091, 2019. 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Shafin Rahman, Salman Khan, and Fatih Porikli. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision, pages 547–563. Springer, 2018. 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. pages 779–788, 2016. 1, 3
    Google ScholarFindings
  • Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016. 1
    Findings
  • Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 3, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1
    Google ScholarLocate open access versionFindings
  • Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. Gradient matching generative networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2168–2178, 2019. 2
    Google ScholarLocate open access versionFindings
  • Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero- and fewshot learning via aligned variational autoencoders. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
    Google ScholarLocate open access versionFindings
  • Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943, 2013. 2
    Google ScholarLocate open access versionFindings
  • Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015. 4
    Google ScholarLocate open access versionFindings
  • Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2
    Google ScholarLocate open access versionFindings
  • Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 2018. 2, 8
    Google ScholarLocate open access versionFindings
  • Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2
    Google ScholarLocate open access versionFindings
  • Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and the ugly. In 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2
    Google ScholarLocate open access versionFindings
  • Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. 2
    Findings
  • Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2
    Google ScholarLocate open access versionFindings
  • Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Generalized zero-shot recognition based on visually semantic embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
    Google ScholarLocate open access versionFindings
  • Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Learning classifiers for target domain with limited or no labels. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7643–7653, Long Beach, California, USA, 09–15 Jun 2019. PMLR. 2
    Google ScholarLocate open access versionFindings
  • Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Zero shot detection. IEEE Transactions on Circuits and Systems for Video Technology, 2019. 2, 3, 5, 7
    Google ScholarLocate open access versionFindings
  • Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments