Relation Networks for Object Detection

computer vision and pattern recognition, 2018.

Cited by: 364|Bibtex|Views103|DOI:https://doi.org/10.1109/cvpr.2018.00378
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
The comprehensive ablation experiments suggest that the relation modules have learnt information between objects that is missing when learning is performed on individual objects

Abstract:

Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learni...More

Code:

Data:

0
Introduction
  • Recent years have witnessed significant progress in object detection using deep convolutional neutral networks (CNNs) [27].
  • A heuristic and hand crafted post-processing step, non-maximum suppression (NMS), is applied to remove duplicate detections
  • It has been well recognized in the vision community for years that contextual information, or relation between objects, helps object recognition [12, 17, 46, 47, 39, 36, 17, 16, 6].
Highlights
  • Recent years have witnessed significant progress in object detection using deep convolutional neutral networks (CNNs) [27]
  • Given a sparse set of region proposals, object classification and bounding box regression are performed on each proposal individually
  • For the first time we propose an adapted attention module for object detection
  • The comprehensive ablation experiments suggest that the relation modules have learnt information between objects that is missing when learning is performed on individual objects
  • It is not clear what is learnt in the relation module, especially when multiple ones are stacked
  • We investigate the relation module in the {r1, r2} = {1, 0} head in Table 1(c)
Methods
  • All experiments are performed on COCO detection datasets with 80 object categories [34].
  • A union of 80k train images and a 35k subset of val images are used for training [2, 32].
  • Most ablation experiments report detection accuracies on a subset of 5k unused val images as common practice [2, 32].
  • Table 5 reports accuracies on test-dev for system-level comparison.
Results
  • The computation overhead is relatively small, compared to the complexity of whole detection networks as shown in Table. 5.
Conclusion
  • The comprehensive ablation experiments suggest that the relation modules have learnt information between objects that is missing when learning is performed on individual objects.
  • It is not clear what is learnt in the relation module, especially when multiple ones are stacked.
  • The right example suggests that the person contributes to the glove
  • While these examples are intuitive, the understanding of how relation module works is preliminary and left as future work
Summary
  • Introduction:

    Recent years have witnessed significant progress in object detection using deep convolutional neutral networks (CNNs) [27].
  • A heuristic and hand crafted post-processing step, non-maximum suppression (NMS), is applied to remove duplicate detections
  • It has been well recognized in the vision community for years that contextual information, or relation between objects, helps object recognition [12, 17, 46, 47, 39, 36, 17, 16, 6].
  • Methods:

    All experiments are performed on COCO detection datasets with 80 object categories [34].
  • A union of 80k train images and a 35k subset of val images are used for training [2, 32].
  • Most ablation experiments report detection accuracies on a subset of 5k unused val images as common practice [2, 32].
  • Table 5 reports accuracies on test-dev for system-level comparison.
  • Results:

    The computation overhead is relatively small, compared to the complexity of whole detection networks as shown in Table. 5.
  • Conclusion:

    The comprehensive ablation experiments suggest that the relation modules have learnt information between objects that is missing when learning is performed on individual objects.
  • It is not clear what is learnt in the relation module, especially when multiple ones are stacked.
  • The right example suggests that the person contributes to the glove
  • While these examples are intuitive, the understanding of how relation module works is preliminary and left as future work
Tables
  • Table1: Ablation study of relation module structure and parameters (* for default). mAP@all is reported
  • Table2: Comparison of various heads with similar complexity
  • Table3: Ablation study of input features for duplicate removal network (none indicates without such feature)
  • Table4: Comparison of NMS methods and our approach (Section 4.3). Last row uses end-to-end training (Section 4.4)
  • Table5: Improvement (2fc head+SoftNMS [<a class="ref-link" id="c4" href="#r4">4</a>], 2fc+RM head+SoftNMS and 2fc+RM head+e2e from left to right connected by →) in state-of-the-art systems on COCO minival and test-dev. Online hard example mining (OHEM) [<a class="ref-link" id="c40" href="#r40">40</a>] is adopted. Also note that the strong SoftNMS method (σ = 0.6) is used for duplicate removal in non-e2e approaches
Download tables as Excel
Related work
  • Object relation in post-processing Most early works use object relations as a post-processing step [12, 17, 46, 47, 36, 17]. The detected objects are re-scored by considering object relationships. For example, co-occurrence, which indicates how likely two object classes can exist in a same image, is used by DPM [15] to refine object scores. The subsequent approaches [7, 36] try more complex relation models, by taking additional position and size [3] into account. We refer readers to [16] for a more detailed survey. These methods achieve moderate success in pre-deep learning era but do not prove effective in deep ConvNets. A possible reason is that deep ConvNets have implicitly incorporated contextual information by the large receptive field.
Reference
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, pages 2874–2883, 2016.
    Google ScholarLocate open access versionFindings
  • I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982.
    Google ScholarLocate open access versionFindings
  • N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Softnms–improving object detection with one line of code. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Britz, A. Goldie, T. Luong, and Q. Le. Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906, 2017.
    Findings
  • X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • M. J. Choi, A. Torralba, and A. S. Willsky. A tree-based context model for object recognition. TPAMI, 34(2):240– 252, Feb 2012.
    Google ScholarLocate open access versionFindings
  • F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017.
    Findings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010.
    Google ScholarLocate open access versionFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
    Google ScholarLocate open access versionFindings
  • C. Galleguillos and S. Belongie. Context based object categorization: A critical survey. In CVPR, 2010.
    Google ScholarFindings
  • C. Galleguillos, A. Rabinovich, and S. Belongie. Object categorization using co-occurrence, location and appearance. In CVPR, 2008.
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast R-CNN. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with r* cnn. In ICCV, pages 1080–1088, 2015.
    Google ScholarLocate open access versionFindings
  • G. Gkioxari, R. B. Girshick, P. Dollar, and K. He. Detecting and recognizing human-object interactions. CoRR, abs/1704.07333, 2017.
    Findings
  • S. Gupta, B. Hariharan, and J. Malik. Exploring person context and local scene context for object detection. CoRR, abs/1511.08177, 2015.
    Findings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • J. Hosang, R. Benenson, and B. Schiele. Learning nonmaximum suppression. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016.
    Findings
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
    Google ScholarLocate open access versionFindings
  • J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object detection. arXiv preprint arXiv:1603.07415, 2016.
    Findings
  • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709, 2016.
    Findings
  • T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. arXiv preprint arXiv: 1708.02002, 2017.
    Findings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
    Findings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR. 2014.
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV, 2006.
    Google ScholarLocate open access versionFindings
  • A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
    Findings
  • R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In ICCV, 2016.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In ICCV, 2003.
    Google ScholarFindings
  • Z. Tu. Auto-context and its application to high-level vision tasks. In CVPR, 2008.
    Google ScholarLocate open access versionFindings
  • J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
    Findings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • B. Yao and L. Fei-Fei. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. TPAMI, 34(9):1691–1703, Sept 2012.
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.
    Findings
Full Text
Your rating :
0

 

Tags
Comments