AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
In contrast to algorithmically generated Gpaths, each Follower path reflects a grounded human interpretation of an instruction, which may deviate from the Guide path because multiple correct interpretations are possible
Room Across Room: Multilingual Vision and Language Navigation with Dense Spatiotemporal Grounding
EMNLP 2020, pp.4392-4412, (2020)
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Further...More
PPT (Upload PPT)
- Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement.
- Datasets have been collected for both indoor (Anderson et al, 2018b; Thomason et al, 2019b; Qi et al, 2020) and outdoor (Chen et al, 2019; Mehta et al, 2020) environments; success in these is based on clearly-defined, objective task completion rather than language or vision specific annotations
- These VLN tasks fall in the Goldilocks zone: they can be tackled – but not solved – with current methods, and progress on them makes headway on real world grounded language understanding.
- The authors concatenate R2R and RxR annotations as a simple multitask strategy (Wang et al, 2020): the agent trained on both datasets obtains across the board improvements
- Vision-and-Language Navigation (VLN) tasks require computational agents to mediate the relationship between language, visual scenes and movement
- We provide monolingual and multilingual baseline experiments using a variant of the Reinforced Cross-Modal Matching agent (Wang et al, 2019)
- Performance generally improves by using monolingual learning, and by using RxR’s follower paths as well as its guide paths
- In contrast to algorithmically generated Gpaths, each Follower path (F-path) reflects a grounded human interpretation of an instruction, which may deviate from the Guide path (G-path) because multiple correct interpretations are possible (e.g., Figure 4)
- RxR represents a significant evolution in the scale, scope and possibilities for research on embodied language agents in simulated, photo-realistic 3D environments
- Every instruction is accompanied by a Follower demonstration, including a perspective camera pose trace that shows a play-by-play account of how a human interpreted the instructions given their position and progress through the path
- Unimodal Ablations Table 7 reports the performance of the multilingual agent under settings in which the authors ablate either the vision or the language inputs during both training and evaluation, as advocated by Thomason et al (2019a).
- The language-only agent performs better than the vision-only agent.
- This is likely because even without vision, parts of the instructions such as ‘turn left‘ and ‘go upstairs‘ still have meaning in the context of the navigation graph.
- The vision-only model has no access to the instructions, without which the paths are highly random
- 5 provides results on the val-unseen split for several training settings, as well as human performance from Follower annotations.
- The authors report en-US and en-IN results together as en.
- Experiments 1–3 compare agents trained (1) only on G-paths, (2) only on F-paths, and (3) on both.
- The authors do not differentiate F-paths from G-paths, and each Setting Training NE ↓ SR ↑ SDTW ↑ NDTW ↑.
- Method G F X Pairs (K) en hi te en hi te en hi te en hi te (1) Mono (2) Mono (3) Mono (4) Multi (5) Multi (6) Multi* (H) Human
- RxR represents a significant evolution in the scale, scope and possibilities for research on embodied language agents in simulated, photo-realistic 3D environments.
- The authors have shown that these can help with agent training, but they open up new possibilities for studying grounded language pragmatics in the VLN setting, and for training VLN agents with perspective cameras – either in the graph-based simulator or by lifting RxR into a continuous simulator (Krantz et al, 2020)
- Table1: VLN dataset comparison. RxR is larger, multilingual, and includes dense spatiotemporal groundings (Ground) and follower demonstrations (Demos)
- Table2: RxR summary statistics. Times in seconds (s)
- Table3: Linguistic phenomena in a manually annotated random sample of 25 paths from RxR and R2R. p is the % of sentences that contain the phenomena while μ is the average number of times they occur within each sentence
- Table4: Simple baselines on val-unseen paths. RxR proves more difficult than R2R overall, and less amenable to agents that tend to go straight (baselines 2 and 3). Note: Baseline 3 partly exploits the gold path
- Table5: RxR val-unseen: Monolingual vs. multilingual results. Training with both Guide and Follower paths benefits all languages (exp. 3 vs. 1 and 2), monolingual outperforms multilingual (exp. 3 vs. 4), training with cross-translations hurts performance (exp. 5 vs. 4), and training with visual attention supervision gives mixed results (Multi* in exp. 6 vs 4)
- Table6: Multitask and transfer learning results on RxR and R2R val-unseen. A multitask model (exp. 8) performs best on both datasets, but domain differences thwart simple transfer learning (i.e., train on X, evaluate on Y)
- Table7: Language-only and vision-only model ablations on RxR val-unseen. The language-only agent is much better than random, but both modalities are required for best performance
- Table8: RxR test set results, based on the monolingual agents (3) and the multilingual agent (4)
Study subjects and analysis
resulting Guide-Follower pairs: 3
If the second Follower also fails, then the path is reenqueued to generate another Guide and Follower annotation. The most successful of the three resulting Guide-Follower pairs is selected for inclusion in RxR and the others are discarded. In addition to validating data quality, the Follower task also trains annotators to be better Guides—following bad instructions often helps one see how to produce better instructions
- Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In NAACL-HLT.
- Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018a. On evaluation of embodied navigation agents. arXiv:1807.06757.
- Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018b. Visionand-Language Navigation: Interpreting visuallygrounded navigation instructions in real environments. In CVPR.
- Andrea Bender and Sieghard Beller. 201Mapping spatial frames of reference onto time: A review of theoretical accounts and empirical findings. Cognition, 132(3):342–382.
- Emily M. Bender. 2009. Linguistically naıve != language independent: Why NLP needs linguistic typology. In EACL Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?
- Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. 3DV.
- David L Chen and Raymond J Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In AAAI.
- Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In CVPR.
- Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. TACL.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. In NeurIPS.
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumee III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv:1803.09010.
- Daniel BM Haun, Christian J Rapold, Gabriele Janzen, and Stephen C Levinson. 2011. Plasticity of human spatial cognition: Spatial language and cognition covary across cultures. Cognition, 119(1):70–80.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation.
- Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, and Stephen Gould. 2020. Sub-instruction aware visionand-language navigation. In EMNLP.
- Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, and Eugene Ie. 2019. Transferable representation learning in vision-and-language navigation. In ICCV.
- Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. 2019. Effective and general evaluation for instruction conditioned navigation using dynamic time warping. NeurIPS Visually Grounded Interaction and Language Workshop (ViGIL).
- Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. 20Stay on the path: Instruction fidelity in vision-andlanguage navigation. In ACL.
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
- Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision and language navigation in continuous environments. In ECCV.
- Larry Lansing, Vihan Jain, Harsh Mehta, Haoshuo Huang, and Eugene Ie. 2019. VALAN: Vision and language agent navigation. arXiv:1912.03241.
- Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. 2019. Robust navigation with language pretraining and stochastic sampling. In EMNLP.
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2014. Effective approaches to attentionbased neural machine translation. In EMNLP.
- Harsh Mehta, Yoav Artzi, Jason Baldridge, Eugene Ie, and Piotr Mirowski. 2020. Retouchdown: Adding touchdown to streetlearn as a shareable resource for language grounding tasks in street view. EMNLP Workshop on Spatial Language Understanding (SpLU).
- Dipendra Kumar Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping instructions to actions in 3d environments with visual goal prediction. In EMNLP.
- Edward Munnich, Barbara Landau, and Barbara Anne Dosher. 2001. Spatial language and spatial representation: A cross-linguistic comparison. Cognition, 81(3):171–208.
- Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, and Yinfei Yang. 2020. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. arXiv:2004.15020.
- Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. 2020. Connecting vision and language with localized narratives. In ECCV.
- Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. 2020. Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. arXiv:2007.03001.
- Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. REVERIE: Remote embodied visual referring expression in real indoor environments. In CVPR.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. IJCV.
- Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. 2019. Taking a hint: Leveraging explanations to make vision and language models more grounded. In ICCV.
- Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
- Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2020. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR.
- Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL.
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
- Jesse Thomason, Daniel Gordon, and Yonatan Bisk. 2019a. Shifting the baseline: Single modality performance on visual navigation & QA. In NAACL.
- Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2019b. Vision-and-dialog navigation. In CoRL.
- Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR.
- Xin Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, and Sujith Ravi. 2020. Environment-agnostic multitask learning for natural language grounded navigation. In ECCV.
- Jialin Wu and Raymond Mooney. 2019. Self-critical reasoning for robust visual question answering. In NeurIPS.
- Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. 2020. BabyWalk: Going farther in vision-and-language navigation by taking baby steps. In ACL.