Visual Grounding in Video for Unsupervised Word Translation

CVPR, pp. 10847-10856, 2020.

Cited by: 7|Views143
EI
Weibo:
Learning multiple languages is a challenging problem that multilingual children tackle with ease

Abstract:

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation betw...More

Code:

Data:

0
Introduction
  • Children can learn multiple languages by merely observing their environment and interacting with others, without any explicit supervision or instruction; multilingual children do not hear a sentence and its translation simultaneously, and they do not hear a sentence in multiple languages while observing the same situation [20].
Highlights
  • Children can learn multiple languages by merely observing their environment and interacting with others, without any explicit supervision or instruction; multilingual children do not hear a sentence and its translation simultaneously, and they do not hear a sentence in multiple languages while observing the same situation [20]
  • The contributions are threefold. (i) We propose a method to map languages through the visual domain using only unpaired instructional videos, we demonstrate that our method is effective at connecting words in different languages through vision in an unsupervised manner, and we show that our method can serve as a good initialization for existing word mapping techniques addressing many shortcomings of text-based methods
  • 1) Iterative Procrustes 2) MUSE [10] 3) VecMap [4] 4) MUVE 5) Supervised based methods are more suited for similar languages (e.g., English and French) [4, 43] and shows that grounding in visual domain for word translation is especially effective in that regime
  • We observe in Table 1 a significant improvement of MUVE over our Base model alone (+19.8% and +30.3% absolute improvement on the Dictionary and Simple Words benchmarks, respectively). This experiment validates our intuition that the information contained in the visual domain is complementary to the word co-occurence statistics used by the text-based methods for the task of unsupervised word translation
  • Learning multiple languages is a challenging problem that multilingual children tackle with ease
  • We propose an unsupervised multimodal model for word translation that learns from instructional YouTube videos
Methods
  • The authors first provide the implementation details (Sec. 5.1); in Sec. 5.2, the authors demonstrate the effectiveness of the Base Model in word translation benchmarks.
  • In Sec. 5.3, the authors show that the representations learned by the model can be used to improve the quality of textbased word translation methods.
  • The authors train monolingual word embeddings using Word2Vec [37] (Skip-Gram, 300 dim, 5 words, 5 negatives).
  • The authors use these pretrained embeddings in MUVE, MUSE, and VecMap models
Results
  • The authors report the results of the models and the baselines on the Dictionary and Simple Words benchmarks in Table 1.
  • The authors observe in Table 1 a significant improvement of MUVE over the Base model alone (+19.8% and +30.3% absolute improvement on the Dictionary and Simple Words benchmarks, respectively).
  • Overall, this experiment validates the intuition that the information contained in the visual domain is complementary to the word co-occurence statistics used by the text-based methods for the task of unsupervised word translation.
  • The Dictionary evaluation set are the “Ground-truth bilingual dictionaries”: en-fr.5000-6500.txt, en-ko.5000-6500.txt, en-ja.5000-6500.txt, available at github.com/facebookresearch/MUSE
Conclusion
  • Learning multiple languages is a challenging problem that multilingual children tackle with ease.
  • The shared visual domain can help as it allows children to relate words in different languages through the similarity of their visual experience.
  • The authors propose an unsupervised multimodal model for word translation that learns from instructional YouTube videos.
  • This is beneficial over text-based methods, allowing for more robust translation when faced with diverse corpora.
  • Future work needs to explore extensions to the proposed model for translating full sentences
Summary
  • Introduction:

    Children can learn multiple languages by merely observing their environment and interacting with others, without any explicit supervision or instruction; multilingual children do not hear a sentence and its translation simultaneously, and they do not hear a sentence in multiple languages while observing the same situation [20].
  • Methods:

    The authors first provide the implementation details (Sec. 5.1); in Sec. 5.2, the authors demonstrate the effectiveness of the Base Model in word translation benchmarks.
  • In Sec. 5.3, the authors show that the representations learned by the model can be used to improve the quality of textbased word translation methods.
  • The authors train monolingual word embeddings using Word2Vec [37] (Skip-Gram, 300 dim, 5 words, 5 negatives).
  • The authors use these pretrained embeddings in MUVE, MUSE, and VecMap models
  • Results:

    The authors report the results of the models and the baselines on the Dictionary and Simple Words benchmarks in Table 1.
  • The authors observe in Table 1 a significant improvement of MUVE over the Base model alone (+19.8% and +30.3% absolute improvement on the Dictionary and Simple Words benchmarks, respectively).
  • Overall, this experiment validates the intuition that the information contained in the visual domain is complementary to the word co-occurence statistics used by the text-based methods for the task of unsupervised word translation.
  • The Dictionary evaluation set are the “Ground-truth bilingual dictionaries”: en-fr.5000-6500.txt, en-ko.5000-6500.txt, en-ja.5000-6500.txt, available at github.com/facebookresearch/MUSE
  • Conclusion:

    Learning multiple languages is a challenging problem that multilingual children tackle with ease.
  • The shared visual domain can help as it allows children to relate words in different languages through the similarity of their visual experience.
  • The authors propose an unsupervised multimodal model for word translation that learns from instructional YouTube videos.
  • This is beneficial over text-based methods, allowing for more robust translation when faced with diverse corpora.
  • Future work needs to explore extensions to the proposed model for translating full sentences
Tables
  • Table1: The performance of our models and the baselines as Recall@1 on En-Fr Dictionary and Simple Words
  • Table2: Performance of our and text-based methods across different language pairs. We report Recall@1 on the Dictionary dataset. All method use word embeddings trained on HowToW-Text for their respective languages
  • Table3: Robustness of different methods to the dissimilarity of training corpora. We report Recall@10 on EnglishFrench Dictionary dataset for MUSE [<a class="ref-link" id="c10" href="#r10">10</a>], VecMap [<a class="ref-link" id="c4" href="#r4">4</a>], and MUVE, as well as the dissimilarity (∼) of the training corpora expressed with the Jensen Shannon Distance
  • Table4: Top 2 retrieved results in French on the Human Queries dataset given an English query
  • Table5: The performance of our models and the baselines measured as Recall@10 on the Dictionary and Simple Words benchmarks
  • Table6: Performance of our and text-based methods across different language pairs. We report Recall@10 on the Dictionary dataset
Download tables as Excel
Funding
  • Demonstrates that can map words between the languages, the ‘visual’ words; that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and our approach achieves superior performance by addressing the shortcomings of text-based methods – it is more robust, handles datasets with less commonality, and is applicable to low-resource languages
  • Demonstrates that, despite these challenges, a shared visual representation can facilitate the mapping of different languages at the word level
  • Proposes a model that maps two languages through the visual domain
  • Proposes a method to map languages through the visual domain using only unpaired instructional videos, demonstrates that our method is effective at connecting words in different languages through vision in an unsupervised manner, and shows that our method can serve as a good initialization for existing word mapping techniques addressing many shortcomings of text-based methods
  • Explores whether sharing the conceptual representation improve the quality of word translation for different languages
  • When the training corpora in French and English are dissimilar , our method achieves a 32.6% recall while that of the text-based ones is less than 0.5%
Reference
  • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Learning bilingual word embeddings with (almost) no bilingual data. In ACL, 2017.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. Matching words and pictures. JMLR, 2003.
    Google ScholarLocate open access versionFindings
  • Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, and Stella Frank. Findings of the third shared task on multimodal machine translation. 2018.
    Google ScholarFindings
  • João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. arXiv:1710.04087, 2017.
    Findings
  • Annick De Houwer. Bilingual language acquisition. The handbook of child language, 2017.
    Google ScholarFindings
  • Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
    Google ScholarFindings
  • Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, 2002.
    Google ScholarLocate open access versionFindings
  • Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kardas, Sylvain Gugger, and Jeremy Howard. Multifit: Efficient multi-lingual language model fine-tuning. EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30k: Multilingual english-german image descriptions. ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In NeurIPS, 2013.
    Google ScholarFindings
  • Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Fred Genesee. Early bilingual development: One language or two? Journal of child language, 1989.
    Google ScholarLocate open access versionFindings
  • Fred Genesee, Johanne Paradis, and Martha B. Crago. Dual language development & disorders: A handbook on bilingualism & second language learning. Paul H Brookes Publishing, 2004.
    Google ScholarFindings
  • Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • Thanh-Le Ha, Jan Niehues, and Alexander Waibel. Toward multilingual neural machine translation with universal encoder and decoder. IWSLT, 2016.
    Google ScholarLocate open access versionFindings
  • Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 1990.
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Mateusz Malinowski, Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, and Raia Hadsell. Learning to follow directions in street view. AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Yedid Hoshen and Lior Wolf. Non-adversarial unsupervised word translation. EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. ACL, 2017.
    Google ScholarLocate open access versionFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv:1602.02410, 2016.
    Findings
  • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In ACL, 2015.
    Google ScholarLocate open access versionFindings
  • Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia. Learning visual question answering by bootstrapping hard attention. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A deep learning approach to visual question answering. IJCV, 2017.
    Google ScholarLocate open access versionFindings
  • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. arXiv:1912.06430, 2019.
    Findings
  • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
    Google ScholarFindings
  • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv:1309.4168, 2013.
    Findings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013.
    Google ScholarFindings
  • Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, and Bernt Schiele. Generating descriptions with grounded and co-referenced people. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Candace Ross, Andrei Barbu, Yevgeni Berzak, Battushig Myanganbayar, and Boris Katz. Grounding language acquisition by training semantic parsers using captioned videos. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Ozan Sener, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. Unsupervised semantic parsing of video collections. In ICCV, 2015.
    Google ScholarFindings
  • Kevin Shen, Amlan Kar, and Sanja Fidler. Lifelong learning for image captioning by asking natural language questions. ICCV 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Anders Søgaard, Sebastian Ruder, and Ivan Vulic. On the limitations of unsupervised bilingual dictionary induction. ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Yuanhang Su, Kai Fan, Nguyen Bach, C-C Jay Kuo, and Fei Huang. Unsupervised multi-modal neural machine translation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, highquality multilingual dataset for video-and-language research. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In ACL, 2015.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2015.
    Findings
  • Shoou-I Yu, Lu Jiang, and Alexander Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM, 2014.
    Google ScholarFindings
  • Yuke Zhu, Oliver Groth, Michael Bernstein, and Li FeiFei. Visual7w: Grounded question answering in images. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments