Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang Chao
Zhang Chao
Yang Zichao
Yang Zichao

IEEE Journal of Selected Topics in Signal Processing, pp. 478-493, 2019.

Cited by: 0|Bibtex|Views71|DOI:https://doi.org/10.1109/JSTSP.2020.2987728
EI WOS
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language

Abstract:

Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more...More

Code:

Data:

Introduction
  • S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6].
Highlights
  • S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6]
  • This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles — learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications
  • We review the key concept of embedding, which unifies the multimodal signals into the same vector space and enables cross-modality signal processing
  • Dating back to the dramatic increase in the accuracy of large-scale automatic speech recognition (ASR) using fully connected deep neural networks (DNN) and deep auto-encoders around 2010 [7]– [17], and followed by a set of breakthroughs in computer vision (CV) using deep convolutional neural network (CNN) models [18] for large-scale image classification around 2012 [19]–[22] and large-scale object detection [23]–[25] around 2014, a set of major milestones have been achieved in pattern recognition with single input modality
  • This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language
  • In the section of representations, both single modal and multimodal representations are reviewed under the key concept of embedding
Conclusion
  • This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language.
  • The authors propose to organize the many pieces of work in the language-vision multimodal intelligence field from three aspects, which include multimodal representations, the fusion of multimodal signals, and the applications of multimodal intelligence.
  • Three selected areas of broad interest are presented, which include image caption generation, text-to-image synthesis, and visual question answering.
Summary
  • Introduction:

    S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6].
  • Conclusion:

    This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language.
  • The authors propose to organize the many pieces of work in the language-vision multimodal intelligence field from three aspects, which include multimodal representations, the fusion of multimodal signals, and the applications of multimodal intelligence.
  • Three selected areas of broad interest are presented, which include image caption generation, text-to-image synthesis, and visual question answering.
Reference
  • G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 2006.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, vol. 2, pp. 1–127, 2009.
    Google ScholarLocate open access versionFindings
  • L. Deng and Y. Dong, “Deep Learning: Methods and Applications,” Foundations and Trends in Signal Processing, vol. 7, pp. 197–387, 2014.
    Google ScholarLocate open access versionFindings
  • J. Schmidhuber, “Deep learning in neural networks: An overview,” NeuralNetworks, vol. 61, pp. 85–117, 2015.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
    Google ScholarFindings
  • D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” Proc. NIPS Workshop, 2010.
    Google ScholarLocate open access versionFindings
  • L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep autoencoder,” Proc. Interspeech, 2010.
    Google ScholarLocate open access versionFindings
  • L. Deng, “An overview of deep-structured learning for information processing,” in Proc. APSIPA ASC, 2011.
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, F. Seide, and G. Li, “Discriminative pre-training of deep nerual networks,” in U.S. Patent No. 9,235,799, 2011.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, and L. Deng, “Large-vocabulry continuous speech recognition with context-dependent DBN-HMMs,” in Proc. ICASSP, 2011.
    Google ScholarLocate open access versionFindings
  • L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, and A. Acero, “Recent advances in deep learning for speech research at Microsoft,” in Proc. ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, pp. 30–42, 2012.
    Google ScholarLocate open access versionFindings
  • F. Seide, L. Gang, and Y. Dong, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, Y. Dong, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, 2012.
    Google ScholarLocate open access versionFindings
  • L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” Proc. ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer, 2015.
    Google ScholarFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, pp. 530–539, 2015.
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” in arXiv:1609.08144, 2016.
    Findings
  • M.-T. Luong, H. Pham, and C. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
    Google ScholarFindings
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-yraining,” in https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf, 2018.
    Locate open access versionFindings
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • H.-Y. Shum, X. He, and D. Li, “From Eliza to XiaoIce: Challenges and opportunities with social chatbots,” Frontiers of Information Technology & Electronic Engineering, vol. 19, pp. 10–19, 2018.
    Google ScholarLocate open access versionFindings
  • S. Bengio, L. Deng, L. Morency, and B. Schuller, Perspectives on Predictive Power of Multimodal Deep Learning: Surprises and Future Directions. Chapter 14 in Book: The Handbook of Multimodal-Multisensor Interfaces. ACM and Morgan & Claypool Publishers, 2019.
    Google ScholarLocate open access versionFindings
  • L. Deng and Y. Liu, Deep Learning in Natural Language Processing.
    Google ScholarFindings
  • S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proc.
    Google ScholarLocate open access versionFindings
  • L. Yu, P. Poirson, S. Yang, A. Berg, and T. Berg, “Modeling context in referring expressions,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and L. S., “Flickr30k entities: Collecting region-to phrase correspondences for richer image-to-sentence models,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • J. Johnson, A. Karpathy, and F.-F. Li, “Densecap: Fully convolutional localization networks for dense captioning,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual Turing test for computer vision systems,” in Proc. NAS, 2015.
    Google ScholarLocate open access versionFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batral, C. Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • L. Yu, E. Park, A. Berg, and T. Berg, “Visual Madlibs: Fill in the blank description generation and question answering,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, pp. 141–151, 2000.
    Google ScholarLocate open access versionFindings
  • M. Cookea, J. Barker, S. Cunningham, and X. Shao, “An audiovisual corpus for speech perception and automatic speech recognition,” Journal of Acoustic Society of America, vol. 120, pp. 2421–2424, 2006.
    Google ScholarLocate open access versionFindings
  • T. Afouras, J. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. Early Access, pp. 1–13, 2018.
    Google ScholarLocate open access versionFindings
  • B. Maison, C. Neti, and A. Senior, “Audio-visual speaker recognition for video broadcast news: Some fusion techniques,” in Proc. MMSP, 1999.
    Google ScholarLocate open access versionFindings
  • Z. Wu, L. Cai, and H. Meng, “Multi-level fusion of audio and visual features for speaker identification,” in Advances in Biometrics (D. Zhang and A. Jain, eds.), pp. 493–499, Springer Berlin Heidelberg, 2005.
    Google ScholarFindings
  • J. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • I. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-visual speaker diarization based on spatiotemporal Bayesian fusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 1086–1099, 2018.
    Google ScholarLocate open access versionFindings
  • J. Chung, B.-J. Lee, and I. Han, “Who said that?: Audio-visual speaker diarisation of real-world meetings,” in Proc. Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in Proc. ASRU, 2019.
    Google ScholarLocate open access versionFindings
  • A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, pp. 112:1–11, 2018.
    Google ScholarLocate open access versionFindings
  • T. Afouras, J. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798–1828, 2013.
    Google ScholarLocate open access versionFindings
  • P.-S. Huang, X. He, G. J., L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proc. CIKM, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semantic representations using convolutional neural networks for web search,” in Proc. WWW, 2014.
    Google ScholarLocate open access versionFindings
  • H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, pp. 694–707, 2016.
    Google ScholarLocate open access versionFindings
  • D. Rumelhart, G. Hinton, and R. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. ICLR, 2013.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and F.-F. Li, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Journal of Neural Computing, vol. 9, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in arXiv:1412.3555, 2014.
    Findings
  • J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proc. EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • A. Elkahky, Y. Song, and X. He, “A multi-view deep learning approach for cross domain user modeling in recommendation systems,” in Proc. WWW, 2015.
    Google ScholarLocate open access versionFindings
  • X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y.-Y. Wang, “Representation learning using multi-task deep neural networks for semantic classification and information retrieval,” in Proc. NAACL, 2015.
    Google ScholarLocate open access versionFindings
  • W.-T. Yih, X. He, and C. Meek, “Semantic parsing for single-relation question answering,” in Proc. ACL, 2014.
    Google ScholarLocate open access versionFindings
  • W.-T. Yih, M.-W. Chang, X. He, and J. Gao, “Semantic parsing via staged query graph generation: Question answering with knowledge base,” in Proc. ACL, 2015.
    Google ScholarLocate open access versionFindings
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler, “Skip-thought vectors,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, W. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proc. NAACL HLT, 2013.
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” in Proc. ICML, 2011.
    Google ScholarLocate open access versionFindings
  • N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Proc. NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • C. Silberer and M. Lapata, “Learning grounded meaning representations with autoencoders,” in Proc. ACL, 2014.
    Google ScholarLocate open access versionFindings
  • H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, “Distributional semantics in technicolor,” in Proc. ACL, 2012.
    Google ScholarLocate open access versionFindings
  • S. Kottur, R. Vedantam, J. Moura, and D. Parikh, “Visual Word2Vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • X. Yang, P. Ramesh, R. Chitta, S. Madhvanath, E. Bernal, and J. Luo, “Deep multimodal representation learning from temporal data,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • P. Bachman, R. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • A. Lazaridou, N. Pham, and M. Baroni, “Combining language and vision with a multimodal skip-gram model,” in Proc. NAACL, 2015.
    Google ScholarLocate open access versionFindings
  • A. Karpathy, A. Joulin, and F.-F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W.-Y. Ma, “Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Y.-H. Tsai, P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov, “Learning factorized multimodal representations,” in Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • T. Gupta, A. Schwing, and D. Hoiem, “ViCo: Word embeddings from visual co-occurrences,” in Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • D.-K. Nguyen and T. Okatani, “Multi-task learning of hierarchical vision-language representation,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • R. Socher, M. Ganjoo, H. Sridhar, M. C. Bastani, O., and A. Ng, “Zeroshot learning through cross-modal transfer,” in Proc. NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A deep visual-semantic embedding model,” in Proc. NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • Y.-H. Tsai, L.-K. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov, “Predicting deep zero-shot convolutional neural networks using textual descriptions,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, B. Schiele, and H. Lee, “Learning deep representations of fine-grained visual descriptions,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • G. Collell and M.-F. Moens, “Do neural network cross-modal mappings really bridge modalities?,” in Proc. ACL, 2018.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 37, pp. 328– 339, 1989.
    Google ScholarLocate open access versionFindings
  • J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural networks,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, “UnicoderVL: A universal encoder for vision and language by cross-modal pretraining,” in arXiv:1908.06066, 2019.
    Findings
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VLBERT: Pre-training of generic visuallinguistic representations,” in arXiv:1908.08530, 2019.
    Findings
  • L. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT: A simple and performant baseline for vision and language,” in arXiv:1908.03557, 2019.
    Findings
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected objects in text for visual question answering,” in Proc. ICMLC, 2019.
    Google ScholarLocate open access versionFindings
  • H. Tan and B. Mohit, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks,” in Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • S. Pramanik, P. Agrawal, and A. Hussain, “OmniNet: A unified architecture for multi-modal multi-task learning,” in arXiv:1907.07804, 2019.
    Findings
  • X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” in Proc. ACL, 2019.
    Google ScholarLocate open access versionFindings
  • P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan, “Multimodal feature fusion for robust event detection in web videos,” in Proc. CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • T. Wortwein and S. Scherer, “What really matters An information gain analysis of questions and reactions in automated PTSD screenings,” in Proc. ACII, 2017.
    Google ScholarLocate open access versionFindings
  • G. Ye, D. Liu, I.-H. Jhuo, and C. S.-F., “Robust late fusion with rank minimization,” in Proc. CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • B. Nojavanasghari, D. Gopinath, J. Koushik, B. T., and L.-P. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proc. ICMI, 2016.
    Google ScholarLocate open access versionFindings
  • H. Wang, A. Meghawat, L.-P. Morency, and E. Xing, “Select-additive learning: Improving generalization in multimodal sentiment analysis,” in Proc. ICME, 2017.
    Google ScholarLocate open access versionFindings
  • A. Anastasopoulos, S. Kumar, and H. Liao, “Neural language modeling with visual features,” in arXiv:1903.02930, 2019.
    Findings
  • V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie, “CentralNet: A multilayer approach for multimodal fusion,” in Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, “Simple baseline for visual question answering,” in arXiv:1512.02167, 2015.
    Findings
  • J.-M. Perez-Rua, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie, “MFAS: Multimodal fusion architecture search,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, F.-F. Li, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • J.-M. Perez-Rua, M. Baccouche, and S. Pateux, “Efficient progressive neural architecture search,” in Proc. BMVC, 2019.
    Google ScholarLocate open access versionFindings
  • X. Yang, P. Molchanov, and J. Kautz, “Multilayer and multimodal fusion of deep neural networks for video classification,” in Proc. ACM MM, 2016.
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” in arXiv:1410.5401, 2014.
    Findings
  • Y. Zhu, O. Groth, M. Bernstein, and F.-F. Li, “Visual7W: Grounded question answering in images,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • K. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • H. Xu and K. Saenko, “Ask, attend and answer: Exploring questionguided spatial attention for visual question answering,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in Proc. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • P. Lu, H. Li, W. Zhang, J. Wang, and X. Wang, “Co-attending freeform regions and detections with multi-modal multiplicative feature embedding for visual question answering,” in Proc. AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • H. Fan and J. Zhou, “Stacked latent attention for multimodal reasoning,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Osman and W. Samek, “DRAU: Dual recurrent attention units for visual question answering,” Computer Vision and Image Understanding, vol. 185, pp. 24–30, 2019.
    Google ScholarLocate open access versionFindings
  • I. Schwartz, A. Schwing, and T. Hazan, “High-order attention models for visual question answering,” in Proc. NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • J. Arevalo, T. Solorio, M. Montes-y Gomez, and F. Gonzalez, “Gated multimodal units for information fusion,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang, “Multimodal residual learning for visual QA,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • H. Noh, P. Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • J. Tenenbaum and W. Freeman, “Separating style and content with bilinear models,” Neural Computing, vol. 12, pp. 1247–1283, 2000.
    Google ScholarLocate open access versionFindings
  • A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. EMNLP, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in Proc. ICALP, 2012.
    Google ScholarLocate open access versionFindings
  • N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in Proc. SIGKDD, 2013.
    Google ScholarLocate open access versionFindings
  • A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, pp. 5947–5959, 2018.
    Google ScholarLocate open access versionFindings
  • L. Tucker, “Some mathematical notes on three-mode factor analy,” Psychometrika, vol. 31, pp. 279–311, 1966.
    Google ScholarLocate open access versionFindings
  • H. Ben-younes, R. Cadene, M. Cord, and N. Thome, “MUTAN: Multimodal tucker fusion for visual question answering,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • L. Lathauwer, “Decompositions of a higher-order tensor in block termspart II: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, pp. 1033–1066, 2008.
    Google ScholarLocate open access versionFindings
  • H. Ben-younes, R. Cadene, N. Thome, and M. Cord, “BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection,” in Proc. AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Z. Liu, Y. Shen, V. Lakshminarasimhan, P. Liang, A. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. ACL, 2018.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” in Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, “Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, and R. Feris, “Dialogbased interactive image retrieval,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, and N. Sunderhauf, “Vision-and-language navigation: Interpreting visuallygrounded navigation instructions in real environments,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” in Proc. ACL, 2019.
    Google ScholarLocate open access versionFindings
  • R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, and K. Saenko, “Are you looking? Grounding to multiple modalities in vision-and-language navigation,” in Proc. ACL, 2019.
    Google ScholarLocate open access versionFindings
  • H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “TOUCHDOWN: Natural language navigation and spatial reasoning in visual street environments,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Wang, and L. Zhang, “Reinforced cross-modal matching and selfsupervised imitation learning for vision-language navigation,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • J. Fu, A. Korattikara, S. Levine, and S. Guadarrama, “From language to goals: Inverse reinforcement learning for vision-based instruction following,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • X. He and L. Deng, “Deep learning for image-to-text generation: A technical overview,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109–116, 2017.
    Google ScholarLocate open access versionFindings
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv preprint arXiv:1412.6632, 2014.
    Findings
  • X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in Proc. CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • K. Tran, X. He, L. Zhang, and J. Sun, “Rich image captioning in the wild,” in Proc. CVPR Workshop, 2016.
    Google ScholarLocate open access versionFindings
  • C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “StyleNet: Generating attractive visual captions with styles,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Li, Q. Huang, X. He, L. Zhang, and M.-T. Sun, “Generating diverse and accurate visual captions by comparative adversarial learning,” in arXiv:1804.00861, 2018.
    Findings
  • A. Graves, “Generating sequences with recurrent neural networks,” in arXiv:1308.0850, 2013.
    Findings
  • K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” in Proc. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” in Proc. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • M. Mirza and S. Osindero, “Conditional generative adversarial nets,” in arXiv:1411.1784, 2014.
    Findings
  • E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proc. NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier GANs,” 2017.
    Google ScholarFindings
  • Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 1947–1962, 2019.
    Google ScholarLocate open access versionFindings
  • M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • A. Dash, J. Gamboal, S. Ahmed, M. Liwicki, and M. Afzal, “TACGAN – Text conditioned auxiliary classifier generative adversarial network,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Cha, Y. Gwon, and H. Kung, “Adversarial learning of semantic relevance in text to image synthesis,” in Proc. AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • X. Chen, M. Rohrbach, and D. Parikh, “Cycle-consistency for robust visual question answering,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-toimage generation by redescription,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD birds 200,” Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.
    Google ScholarFindings
  • M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flower classification,” in Proc. CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. Zitnick, and P. Dollar, “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” in Proc. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • J. Johnson, A. Gupta, and F.-F. Li, “Image generation from scene graphs,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, “Heuristics for image generation from scene graphs,” in Proc. ICLR Workshop LLD, 2019.
    Google ScholarLocate open access versionFindings
  • B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from layout,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at spatially distinct locations,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for hierarchical text-to-image synthesis,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute editing by only changing what you want,” IEEE Transactions on Image Processing, vol. 28, pp. 5464–5478, 2019.
    Google ScholarLocate open access versionFindings
  • S. Nam, Y. Kim, and S. Kim, “Text-adaptive generative adversarial networks: Manipulating images with natural language,” in Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Q. Lao, M. Havaei, A. Pesaranghader, F. Dutil, L. Jorio, and T. Fevens, “Dual adversarial inference for text-to-image synthesis,” in Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • F. Tan, S. Feng, and V. Ordonez, “Text2Scene: Generating compositional scenes from textual descriptions,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • S. Sharma, D. Suhubdy, V. Michalski, S. Kahou, and Y. Bengio, “ChatPainter: Improving text to image generation using dialogue,” in Proc. ICLR Workshop, 2018.
    Google ScholarLocate open access versionFindings
  • A. El-Nouby, S. Sharma, H. Schulz, D. Hjelm, L. Asri, S. Kahou, Y. Bengio, and G. Taylor, “Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction,” in Proc. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • P. Cascante-Bonilla, X. Yin, V. Ordonez, and S. Feng, “Chat-crowd: A dialog-based platform for visual layout composition,” in Proc. NAACLHLT, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Chen, Z. Gan, Y. Li, J. Liu, and J. Gao, “Sequential attention GAN for interactive image editing via dialogue,” in Proc. AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, N. Kitaev, X. Chen, M. Rohrbach, B.-T. Zhang, Y. Tian, D. Batra, and D. Parikh, “CoDraw: Collaborative drawing as a testbed for grounded goal-driven communication,” in Proc. ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, and J. Gao, “StoryGAN: A sequential conditional GAN for story visualization,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation from text,” in Proc. AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Balaji, M. Min, B. Bai, R. Chellappa, and H. Graf, “Conditional GAN with discriminative filter generation for text-to-video synthesis,” in Proc. IJCAI, 2019.
    Google ScholarLocate open access versionFindings
  • M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on yncertain input,” in Proc. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in Proc. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in Proc. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel, “Are you talking to me? Reasoned visual dialog generation through adversarial learning,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • U. Jain, Z. Zhang, and A. Schwing, “Creativity: Generating diverse questions using variational autoencoders,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra, “Visual dialogue,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville, “GuessWhat?! Visual object discovery through multimodal dialogue,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” International Journal of Computer Vision, vol. 127, pp. 398–414, 2019.
    Google ScholarLocate open access versionFindings
  • P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “Explicit knowledge-based reasoning for visual question answering,” in Proc. IJCAI, 2017.
    Google ScholarLocate open access versionFindings
  • P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “FVQA: Fact-based visual question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 2413–2427, 2018.
    Google ScholarLocate open access versionFindings
  • K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: A visual question answering benchmark requiring external knowledge,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • D. Hudson and C. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Dont just assume; Look and answer: Overcoming priors for visual question answering,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • S. Ramakrishnan, A. Agrawal, and S. Lee, “Overcoming language priors in visual question answering with adversarial regularization,” in Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • R. Cadene, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh, “RUBi: Reducing unimodal biases in visual question answering,” in Proc. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, J. Hare, and A. Prugel-Bennett, “Learning to count objects in natural images for visual question answering,” in Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • D. Gurari, Q. Li, A. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. Bigham, “VizWiz grand challenge: Answering visual questions from blind people,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • R. Cadene, H. Ben-younes, M. Cord, and N. Thome, “MUREL: Multimodal relational reasoning for visual question answering,” in Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in Proc. CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Learning to compose neural networks for question answering,” in Proc. NAACL, 2016.
    Google ScholarLocate open access versionFindings
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, F.-F. Li, C. Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F.-F. Li, C. Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in Proc. ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • R. Hu, J. Andreas, T. Darrell, and K. Saenko, “Explainable neural computation via stack neural module networks,” in Proc. ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, “Transparency by design: Closing the gap between performance and interpretability in visual reasoning,” in Proc. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • D. Hudson and C. Manning, “Compositional attention networks for machine reasoning,” in Proc. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neuralsymbolic VQA: Disentangling reasoning from vision and language understanding,” in Proc. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh, “Probabilistic neural-symbolic models for interpretable visual question answering,” in Proc. ICML, 2018.
    Google ScholarLocate open access versionFindings
  • J. Mao, C. Gan, P. Kohli, J. Tenenbaum, and J. Wu, “The neurosymbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” in Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • A. Santoro, D. Raposo, D. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Proc. NIPS, 2017. He was an elected member of the Board of Governors of the IEEE Signal Processing Society, and was Editors-in-Chief of IEEE Signal Processing Magazine and of IEEE/ACM Transactions on Audio, Speech, and Language Processing (2008-2014), for which he received the IEEE SPS Meritorious Service Award. In recognition of the pioneering work on disrupting speech recognition industry using large-scale deep learning, he received the 2015 IEEE SPS Technical Achievement Award for Outstanding Contributions to Automatic Speech Recognition and to Deep Learning. He also received dozens of best paper and patent awards for the contributions to artificial intelligence, machine learning, information retrieval, multimedia signal processing, speech processing and recognition, and human language technology. He is an author or co-author of six technical books on deep learning, speech processing, pattern recognition and machine learning, and, the latest, natural language processing (Springer, June 2018).
    Google ScholarLocate open access versionFindings
  • Chao Zhang is an advisor of JD.com speech team, and a research associate in speech and natural language processing at the University of Cambridge. He received his B.E. and M.S. degrees in 2009 and 2012 respectively, both from the Department of Computer Science & Technology, Tsinghua University, and a Ph.D. degree in 2017 from Cambridge University Engineering Department.
    Google ScholarFindings
  • Xiaodong He (IEEE Member 2003, Senior member 2008, Fellow 2019) is the Deputy Managing Director of JD AI Research, and Head of the Deep learning, NLP and Speech Lab. He is also Affiliate Professor of ECE at the University of Washington (Seattle). His research interests are mainly in deep learning, natural language processing, speech recognition, computer vision, information retrieval, and multimodal intelligence. He has held editorial positions on multiple IEEE Journals and the Transactions of the ACL, and served in the organizing committee/program committee of major speech and language processing conferences. He is a member of the IEEE SLTC for the term of 2015-2017 and the Chair of the IEEE Seattle Section in 2016. He received the Bachelor degree from Tsinghua University in 1996, MS degree from Chinese Academy of Sciences in 1999, and the PhD degree from the University of Missouri – Columbia in 2003.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments