AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced network

SIRI: Spatial Relation Induced Network For Spatial Description Resolution

NIPS 2020, (2020)

Cited by: 0|Views33
EI
Full Text
Bibtex
Weibo

Abstract

Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who ...More

Code:

Data:

0
Introduction
  • Visual localization tasks aim to locate target positions according to language descriptions, where many downstream applications have been developed such as visual question answering [1, 22, 23], visual grounding [20, 18, 28, 6] and spatial description resolution (SDR) [3], etc
  • These languageguided location tasks can be categorized in terms of input formats, e.g. perspective images in visual grounding or panoramic images in the recently introduced SDR.
  • It is worth noting that such crucial issues have not been well addressed in previous work. (3) Panoramic images in visual grounding with a first-person view cover more complex visual details on a street compared to the perspective images with a third-person view in visual grounding.
Highlights
  • Visual localization tasks aim to locate target positions according to language descriptions, where many downstream applications have been developed such as visual question answering [1, 22, 23], visual grounding [20, 18, 28, 6] and spatial description resolution (SDR) [3], etc
  • The Challenge of SDR: Both of visual grounding and spatial description resolution tasks need to explore the correlation between vision and language to locate the target locations
  • We present a novel spatial relationship induced network for the SDR task
  • A global positional information is embedded to alleviate the ambiguities caused by the absence of spatial positions throughout the entire image
  • Since our proposed network can fully explore these spatial relationships and is robust to the visual ambiguities introduce by a copy-paste operation, our proposed spatial relationship induced (SIRI) outperforms the state-of-the-art method LingUnet by 24% for A@80px, and it generalizes consistently well on our proposed extended dataset
Methods
  • The authors illustrate the proposed SIRI network in Figure 2
  • It consists of a visual correlation, a local spatial relationship guided distillation and a global spatial positional embedding.
  • As shownn in Table 5, the proposed SIRI cannot be operate at real-time at the moment but some solutions including model compression and model distillation can still be studied.
  • The authors leave this for future work
Results
  • 2.1 Results in Successful cases This part shows the successful examples of SIRI and LingUnet.
  • When both of the methods localize targets correctly, the SIRI is closer to the ground-truth.
  • Results of Ambiguity in both
  • In this case, the ambiguities in both images and descriptions make it difficult to localize targets correctly.
  • The SIRI and LingUnet predict correctly locally, the final results are wrong because of the ambiguity of the language descriptions.
Conclusion
  • The authors present a novel spatial relationship induced network for the SDR task.
  • It characterizes the object-level visual feature correlations, which enables an object to perceive the surrounding scene.
  • A global positional information is embedded to alleviate the ambiguities caused by the absence of spatial positions throughout the entire image.
Summary
  • Introduction:

    Visual localization tasks aim to locate target positions according to language descriptions, where many downstream applications have been developed such as visual question answering [1, 22, 23], visual grounding [20, 18, 28, 6] and spatial description resolution (SDR) [3], etc
  • These languageguided location tasks can be categorized in terms of input formats, e.g. perspective images in visual grounding or panoramic images in the recently introduced SDR.
  • It is worth noting that such crucial issues have not been well addressed in previous work. (3) Panoramic images in visual grounding with a first-person view cover more complex visual details on a street compared to the perspective images with a third-person view in visual grounding.
  • Methods:

    The authors illustrate the proposed SIRI network in Figure 2
  • It consists of a visual correlation, a local spatial relationship guided distillation and a global spatial positional embedding.
  • As shownn in Table 5, the proposed SIRI cannot be operate at real-time at the moment but some solutions including model compression and model distillation can still be studied.
  • The authors leave this for future work
  • Results:

    2.1 Results in Successful cases This part shows the successful examples of SIRI and LingUnet.
  • When both of the methods localize targets correctly, the SIRI is closer to the ground-truth.
  • Results of Ambiguity in both
  • In this case, the ambiguities in both images and descriptions make it difficult to localize targets correctly.
  • The SIRI and LingUnet predict correctly locally, the final results are wrong because of the ambiguity of the language descriptions.
  • Conclusion:

    The authors present a novel spatial relationship induced network for the SDR task.
  • It characterizes the object-level visual feature correlations, which enables an object to perceive the surrounding scene.
  • A global positional information is embedded to alleviate the ambiguities caused by the absence of spatial positions throughout the entire image.
Tables
  • Table1: Comparison with different methods on TouchDown’s validation set and testing set
  • Table2: Generalization results on our proposed extended Touchdown dataset
  • Table3: The ablation study for all procedures: (I) Object-Level Visual Feature Correlation; (II) Local Spatial Relationship Guided Distillation; (III) Global Spatial Positional Embedding. These procedures are sequentially appended to the LingUnet-only baseline
  • Table4: Performance drop caused by the introduced visual ambiguities for SIRI and LingUnet
  • Table5: The running time, number of parameters and the A@80px for SIRI and LingUnet on the Touchdown dataset
Download tables as Excel
Related work
  • Language Guided Localization Task. Visual grounding [20, 18, 28, 6] and referring expression comprehension [16, 27, 25] aim to locate target objects or regions according to given languages.

    The images in these tasks are perspective images that contain a limited number of entities, and the expression languages are also short. Object detection, which is one of the tasks in these datasets, is commonly used to provide a prior that functions as a correspondence between objects in images and language-based entity nouns. Methods under the object detection framework can be categorized in two ways. The first category [19, 24, 27, 18] has two stages, in which object detection is carried out at the beginning and object proposals are ranked according to the language query. Two-stage approaches, however, are time-consuming. Thus, one-stage approaches [26, 21, 29, 4] have been proposed to achieve greater efficiency. Nevertheless, the object detectors can fail when it comes to real-world environments in spatial resolution description [3], where more objects and complex backgrounds are included with large fields of view, as shown in Figure 1. In addition to this, the given language descriptions in SDR are longer and describe more object pair spatial relationships. Undoubtedly, existing one-stage methods with weak contextual information on objects for grounding do not specialize when processing spatial positioning words. Recently, LingUnet [3] was proposed, and it treats linguistic features as dynamic filters to convolve visual features, taking all regions into consideration. But it does not yet fully explore the visual and spatial relationships in such complex environments. In this paper, we intend to fully investigate these spatial relationships between objects.
Funding
  • This work was supported by the National Key R&D Program of China (2018AAA0100704), NSFC(No 61932020, No 61773272), the Science and Technology Commission of Shanghai Municipality (Grant No 20ZR1436000) and ShanghaiTech-Megavii Joint Lab
Study subjects and analysis
samples: 27575
Touchdown location strings of text are given as natural languages, and the locations are presented as heatmaps. In total, this dataset contains 27, 575 samples for SDR, including 17, 878 training samples, 3, 836 validation samples and 3, 859 testing samples. To see how well our proposed method generalizes in the wild, we built a new, extended dataset of Touchdown, using data collected under the same settings as

Reference
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In IEEE international conference on computer vision, pages 2425–2433, 2015.
    Google ScholarLocate open access versionFindings
  • I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le. Attention augmented convolutional networks. In IEEE International Conference on Computer Vision, pages 3286–3295, 2019.
    Google ScholarLocate open access versionFindings
  • H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.
    Google ScholarLocate open access versionFindings
  • X. Chen, L. Ma, J. Chen, Z. Jie, W. Liu, and J. Luo. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018.
    Findings
  • Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis. Graph-based global reasoning networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019.
    Google ScholarLocate open access versionFindings
  • P. Dogan, L. Sigal, and M. Gross. Neural sequential phrase grounding (seqground). In IEEE Conference on Computer Vision and Pattern Recognition, pages 4175–4184, 2019.
    Google ScholarLocate open access versionFindings
  • S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation with relative location prior. International Journal of Computer Vision, 80(3):300–316, 2008.
    Google ScholarLocate open access versionFindings
  • J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai. Learning region features for object detection. In European Conference on Computer Vision (ECCV), pages 381–395, 2018.
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE international conference on computer vision, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015.
    Google ScholarFindings
  • T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. International Conference for Learning Representations, 2017.
    Google ScholarFindings
  • R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pages 9605–9616, 2018.
    Google ScholarLocate open access versionFindings
  • F. Manhardt, W. Kehl, and A. Gaidon. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2069–2078, 2019.
    Google ScholarLocate open access versionFindings
  • V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision, pages 792–807.
    Google ScholarLocate open access versionFindings
  • E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • B. A. Plummer, P. Kordas, M. Hadi Kiapour, S. Zheng, R. Piramuthu, and S. Lazebnik. Conditional image-text embedding networks. In European Conference on Computer Vision (ECCV), pages 249–264, 2018.
    Google ScholarLocate open access versionFindings
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In IEEE International Conference on Computer Vision, pages 2641–2649, 2015.
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834.
    Google ScholarLocate open access versionFindings
  • A. Sadhu, K. Chen, and R. Nevatia. Zero-shot grounding of objects from natural language queries. In IEEE International Conference on Computer Vision, pages 4694–4703, 2019.
    Google ScholarLocate open access versionFindings
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In In Advances in Neural Information Processing Systems, pages 4967–4976, 2017.
    Google ScholarLocate open access versionFindings
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE international conference on computer vision, pages 618–626, 2017.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):394–407, 2018.
    Google ScholarLocate open access versionFindings
  • P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. v. d. Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1960–1968, 2019.
    Google ScholarLocate open access versionFindings
  • Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo. A fast and accurate one-stage approach to visual grounding. In IEEE International Conference on Computer Vision, pages 4683–4693, 2019.
    Google ScholarLocate open access versionFindings
  • L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Yu, J. Yu, C. Xiang, Z. Zhao, Q. Tian, and D. Tao. Rethinking diversified and discriminative proposal generation for visual grounding. IJCAI, 2018.
    Google ScholarLocate open access versionFindings
  • F. Zhao, J. Li, J. Zhao, and J. Feng. Weakly supervised phrase localization with multi-scale anchored transformer network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5696–5705, 2018. 6000 5000 4000 3000 2000 1000
    Google ScholarLocate open access versionFindings
Author
peiyao wang
peiyao wang
Yanyu Xu
Yanyu Xu
Haojie  Li
Haojie Li
Jianyu Yang
Jianyu Yang
Your rating :
0

 

Tags
Comments
小科