AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In contrast to algorithmically generated Gpaths, each Follower path reflects a grounded human interpretation of an instruction, which may deviate from the Guide path because multiple correct interpretations are possible

Room Across Room: Multilingual Vision and Language Navigation with Dense Spatiotemporal Grounding

EMNLP 2020, pp.4392-4412, (2020)

Cited by: 0|Views232
Full Text
Bibtex
Weibo

Abstract

We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Further...More

Code:

Data:

0
Introduction
  • Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement.
  • Datasets have been collected for both indoor (Anderson et al, 2018b; Thomason et al, 2019b; Qi et al, 2020) and outdoor (Chen et al, 2019; Mehta et al, 2020) environments; success in these is based on clearly-defined, objective task completion rather than language or vision specific annotations
  • These VLN tasks fall in the Goldilocks zone: they can be tackled – but not solved – with current methods, and progress on them makes headway on real world grounded language understanding.
  • The authors concatenate R2R and RxR annotations as a simple multitask strategy (Wang et al, 2020): the agent trained on both datasets obtains across the board improvements
Highlights
  • Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement
  • We provide monolingual and multilingual baseline experiments using a variant of the Reinforced Cross-Modal Matching agent (Wang et al, 2019)
  • Performance generally improves by using monolingual learning, and by using RxR’s follower paths as well as its guide paths
  • In contrast to algorithmically generated Gpaths, each Follower path (F-path) reflects a grounded human interpretation of an instruction, which may deviate from the Guide path (G-path) because multiple correct interpretations are possible (e.g., Figure 4)
  • RxR represents a significant evolution in the scale, scope and possibilities for research on embodied language agents in simulated, photo-realistic 3D environments
  • Every instruction is accompanied by a Follower demonstration, including a perspective camera pose trace that shows a play-by-play account of how a human interpreted the instructions given their position and progress through the path
Methods
  • Unimodal Ablations Table 7 reports the performance of the multilingual agent under settings in which the authors ablate either the vision or the language inputs during both training and evaluation, as advocated by Thomason et al (2019a).
  • The language-only agent performs better than the vision-only agent.
  • This is likely because even without vision, parts of the instructions such as ‘turn left‘ and ‘go upstairs‘ still have meaning in the context of the navigation graph.
  • The vision-only model has no access to the instructions, without which the paths are highly random
Results
  • 5 provides results on the val-unseen split for several training settings, as well as human performance from Follower annotations.
  • The authors report en-US and en-IN results together as en.
  • Experiments 1–3 compare agents trained (1) only on G-paths, (2) only on F-paths, and (3) on both.
  • The authors do not differentiate F-paths from G-paths, and each Setting Training NE ↓ SR ↑ SDTW ↑ NDTW ↑.
  • Method G F X Pairs (K) en hi te en hi te en hi te en hi te (1) Mono (2) Mono (3) Mono (4) Multi (5) Multi (6) Multi* (H) Human
Conclusion
  • RxR represents a significant evolution in the scale, scope and possibilities for research on embodied language agents in simulated, photo-realistic 3D environments.
  • The authors have shown that these can help with agent training, but they open up new possibilities for studying grounded language pragmatics in the VLN setting, and for training VLN agents with perspective cameras – either in the graph-based simulator or by lifting RxR into a continuous simulator (Krantz et al, 2020)
Tables
  • Table1: VLN dataset comparison. RxR is larger, multilingual, and includes dense spatiotemporal groundings (Ground) and follower demonstrations (Demos)
  • Table2: RxR summary statistics. Times in seconds (s)
  • Table3: Linguistic phenomena in a manually annotated random sample of 25 paths from RxR and R2R. p is the % of sentences that contain the phenomena while μ is the average number of times they occur within each sentence
  • Table4: Simple baselines on val-unseen paths. RxR proves more difficult than R2R overall, and less amenable to agents that tend to go straight (baselines 2 and 3). Note: Baseline 3 partly exploits the gold path
  • Table5: RxR val-unseen: Monolingual vs. multilingual results. Training with both Guide and Follower paths benefits all languages (exp. 3 vs. 1 and 2), monolingual outperforms multilingual (exp. 3 vs. 4), training with cross-translations hurts performance (exp. 5 vs. 4), and training with visual attention supervision gives mixed results (Multi* in exp. 6 vs 4)
  • Table6: Multitask and transfer learning results on RxR and R2R val-unseen. A multitask model (exp. 8) performs best on both datasets, but domain differences thwart simple transfer learning (i.e., train on X, evaluate on Y)
  • Table7: Language-only and vision-only model ablations on RxR val-unseen. The language-only agent is much better than random, but both modalities are required for best performance
  • Table8: RxR test set results, based on the monolingual agents (3) and the multilingual agent (4)
Download tables as Excel
Study subjects and analysis
resulting Guide-Follower pairs: 3
If the second Follower also fails, then the path is reenqueued to generate another Guide and Follower annotation. The most successful of the three resulting Guide-Follower pairs is selected for inclusion in RxR and the others are discarded. In addition to validating data quality, the Follower task also trains annotators to be better Guides—following bad instructions often helps one see how to produce better instructions

Reference
  • Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In NAACL-HLT.
    Google ScholarFindings
  • Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018a. On evaluation of embodied navigation agents. arXiv:1807.06757.
    Findings
  • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018b. Visionand-Language Navigation: Interpreting visuallygrounded navigation instructions in real environments. In CVPR.
    Google ScholarFindings
  • Andrea Bender and Sieghard Beller. 201Mapping spatial frames of reference onto time: A review of theoretical accounts and empirical findings. Cognition, 132(3):342–382.
    Google ScholarLocate open access versionFindings
  • Emily M. Bender. 2009. Linguistically naıve != language independent: Why NLP needs linguistic typology. In EACL Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?
    Google ScholarLocate open access versionFindings
  • Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. 3DV.
    Google ScholarFindings
  • David L Chen and Raymond J Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In AAAI.
    Google ScholarLocate open access versionFindings
  • Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In CVPR.
    Google ScholarFindings
  • Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. TACL.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. In NeurIPS.
    Google ScholarFindings
  • Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumee III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv:1803.09010.
    Findings
  • Daniel BM Haun, Christian J Rapold, Gabriele Janzen, and Stephen C Levinson. 2011. Plasticity of human spatial cognition: Spatial language and cognition covary across cultures. Cognition, 119(1):70–80.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
    Google ScholarFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation.
    Google ScholarFindings
  • Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, and Stephen Gould. 2020. Sub-instruction aware visionand-language navigation. In EMNLP.
    Google ScholarFindings
  • Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, and Eugene Ie. 2019. Transferable representation learning in vision-and-language navigation. In ICCV.
    Google ScholarFindings
  • Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. 2019. Effective and general evaluation for instruction conditioned navigation using dynamic time warping. NeurIPS Visually Grounded Interaction and Language Workshop (ViGIL).
    Google ScholarLocate open access versionFindings
  • Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. 20Stay on the path: Instruction fidelity in vision-andlanguage navigation. In ACL.
    Google ScholarFindings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
    Findings
  • Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision and language navigation in continuous environments. In ECCV.
    Google ScholarFindings
  • Larry Lansing, Vihan Jain, Harsh Mehta, Haoshuo Huang, and Eugene Ie. 2019. VALAN: Vision and language agent navigation. arXiv:1912.03241.
    Findings
  • Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. 2019. Robust navigation with language pretraining and stochastic sampling. In EMNLP.
    Google ScholarFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2014. Effective approaches to attentionbased neural machine translation. In EMNLP.
    Google ScholarFindings
  • Harsh Mehta, Yoav Artzi, Jason Baldridge, Eugene Ie, and Piotr Mirowski. 2020. Retouchdown: Adding touchdown to streetlearn as a shareable resource for language grounding tasks in street view. EMNLP Workshop on Spatial Language Understanding (SpLU).
    Google ScholarFindings
  • Dipendra Kumar Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping instructions to actions in 3d environments with visual goal prediction. In EMNLP.
    Google ScholarFindings
  • Edward Munnich, Barbara Landau, and Barbara Anne Dosher. 2001. Spatial language and spatial representation: A cross-linguistic comparison. Cognition, 81(3):171–208.
    Google ScholarLocate open access versionFindings
  • Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, and Yinfei Yang. 2020. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. arXiv:2004.15020.
    Findings
  • Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. 2020. Connecting vision and language with localized narratives. In ECCV.
    Google ScholarFindings
  • Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. 2020. Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. arXiv:2007.03001.
    Findings
  • Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. REVERIE: Remote embodied visual referring expression in real indoor environments. In CVPR.
    Google ScholarFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. IJCV.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. 2019. Taking a hint: Leveraging explanations to make vision and language models more grounded. In ICCV.
    Google ScholarFindings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
    Google ScholarFindings
  • Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2020. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR.
    Google ScholarFindings
  • Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL.
    Google ScholarFindings
  • Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
    Google ScholarFindings
  • Jesse Thomason, Daniel Gordon, and Yonatan Bisk. 2019a. Shifting the baseline: Single modality performance on visual navigation & QA. In NAACL.
    Google ScholarFindings
  • Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2019b. Vision-and-dialog navigation. In CoRL.
    Google ScholarFindings
  • Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR.
    Google ScholarFindings
  • Xin Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, and Sujith Ravi. 2020. Environment-agnostic multitask learning for natural language grounded navigation. In ECCV.
    Google ScholarFindings
  • Jialin Wu and Raymond Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS.
    Google ScholarFindings
  • Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. 2020. BabyWalk: Going farther in vision-and-language navigation by taking baby steps. In ACL.
    Google ScholarFindings
Author
Alexander Ku
Alexander Ku
Roma Patel
Roma Patel
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科