Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
IEEE Journal of Selected Topics in Signal Processing, pp. 478-493, 2019.
EI WOS
Weibo:
Abstract:
Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more...More
Code:
Data:
Introduction
- S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6].
Highlights
- S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6]
- This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles — learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications
- We review the key concept of embedding, which unifies the multimodal signals into the same vector space and enables cross-modality signal processing
- Dating back to the dramatic increase in the accuracy of large-scale automatic speech recognition (ASR) using fully connected deep neural networks (DNN) and deep auto-encoders around 2010 [7]– [17], and followed by a set of breakthroughs in computer vision (CV) using deep convolutional neural network (CNN) models [18] for large-scale image classification around 2012 [19]–[22] and large-scale object detection [23]–[25] around 2014, a set of major milestones have been achieved in pattern recognition with single input modality
- This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language
- In the section of representations, both single modal and multimodal representations are reviewed under the key concept of embedding
Conclusion
- This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language.
- The authors propose to organize the many pieces of work in the language-vision multimodal intelligence field from three aspects, which include multimodal representations, the fusion of multimodal signals, and the applications of multimodal intelligence.
- Three selected areas of broad interest are presented, which include image caption generation, text-to-image synthesis, and visual question answering.
Summary
Introduction:
S IGNIFICANT progress has been made in the field of machine learning in the past years due to the rapid development of deep learning [1]–[6].Conclusion:
This paper reviews the area of modeling and machine learning across multiple modalities based on deep learning, the combination of vision and natural language.- The authors propose to organize the many pieces of work in the language-vision multimodal intelligence field from three aspects, which include multimodal representations, the fusion of multimodal signals, and the applications of multimodal intelligence.
- Three selected areas of broad interest are presented, which include image caption generation, text-to-image synthesis, and visual question answering.
Reference
- G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 2006.
- Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, vol. 2, pp. 1–127, 2009.
- L. Deng and Y. Dong, “Deep Learning: Methods and Applications,” Foundations and Trends in Signal Processing, vol. 7, pp. 197–387, 2014.
- J. Schmidhuber, “Deep learning in neural networks: An overview,” NeuralNetworks, vol. 61, pp. 85–117, 2015.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
- D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” Proc. NIPS Workshop, 2010.
- L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep autoencoder,” Proc. Interspeech, 2010.
- L. Deng, “An overview of deep-structured learning for information processing,” in Proc. APSIPA ASC, 2011.
- D. Yu, L. Deng, F. Seide, and G. Li, “Discriminative pre-training of deep nerual networks,” in U.S. Patent No. 9,235,799, 2011.
- G. Dahl, D. Yu, and L. Deng, “Large-vocabulry continuous speech recognition with context-dependent DBN-HMMs,” in Proc. ICASSP, 2011.
- L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, and A. Acero, “Recent advances in deep learning for speech research at Microsoft,” in Proc. ICASSP, 2013.
- G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, pp. 30–42, 2012.
- F. Seide, L. Gang, and Y. Dong, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011.
- G. Hinton, L. Deng, Y. Dong, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, 2012.
- L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” Proc. ICASSP, 2013.
- D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
- A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. CVPR, 2014.
- R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” in Proc. NIPS, 2015.
- G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, pp. 530–539, 2015.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
- I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” in Proc. NIPS, 2014.
- Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” in arXiv:1609.08144, 2016.
- M.-T. Luong, H. Pham, and C. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015.
- M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-yraining,” in https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf, 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019.
- H.-Y. Shum, X. He, and D. Li, “From Eliza to XiaoIce: Challenges and opportunities with social chatbots,” Frontiers of Information Technology & Electronic Engineering, vol. 19, pp. 10–19, 2018.
- S. Bengio, L. Deng, L. Morency, and B. Schuller, Perspectives on Predictive Power of Multimodal Deep Learning: Surprises and Future Directions. Chapter 14 in Book: The Handbook of Multimodal-Multisensor Interfaces. ACM and Morgan & Claypool Publishers, 2019.
- L. Deng and Y. Liu, Deep Learning in Natural Language Processing.
- S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proc.
- L. Yu, P. Poirson, S. Yang, A. Berg, and T. Berg, “Modeling context in referring expressions,” in Proc. ECCV, 2016.
- B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and L. S., “Flickr30k entities: Collecting region-to phrase correspondences for richer image-to-sentence models,” in Proc. ICCV, 2015.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. CVPR, 2015.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. CVPR, 2015.
- J. Johnson, A. Karpathy, and F.-F. Li, “Densecap: Fully convolutional localization networks for dense captioning,” in Proc. CVPR, 2016.
- D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual Turing test for computer vision systems,” in Proc. NAS, 2015.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batral, C. Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proc. ICCV, 2015.
- L. Yu, E. Park, A. Berg, and T. Berg, “Visual Madlibs: Fill in the blank description generation and question answering,” in Proc. ICCV, 2015.
- X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. ICML, 2016.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in Proc. CVPR, 2018.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proc. CVPR, 2018.
- S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, pp. 141–151, 2000.
- M. Cookea, J. Barker, S. Cunningham, and X. Shao, “An audiovisual corpus for speech perception and automatic speech recognition,” Journal of Acoustic Society of America, vol. 120, pp. 2421–2424, 2006.
- T. Afouras, J. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. Early Access, pp. 1–13, 2018.
- B. Maison, C. Neti, and A. Senior, “Audio-visual speaker recognition for video broadcast news: Some fusion techniques,” in Proc. MMSP, 1999.
- Z. Wu, L. Cai, and H. Meng, “Multi-level fusion of audio and visual features for speaker identification,” in Advances in Biometrics (D. Zhang and A. Jain, eds.), pp. 493–499, Springer Berlin Heidelberg, 2005.
- J. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
- I. Gebru, S. Ba, X. Li, and R. Horaud, “Audio-visual speaker diarization based on spatiotemporal Bayesian fusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 1086–1099, 2018.
- J. Chung, B.-J. Lee, and I. Han, “Who said that?: Audio-visual speaker diarisation of real-world meetings,” in Proc. Interspeech, 2019.
- J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in Proc. ASRU, 2019.
- A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, pp. 112:1–11, 2018.
- T. Afouras, J. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech, 2018.
- Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798–1828, 2013.
- P.-S. Huang, X. He, G. J., L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proc. CIKM, 2013.
- Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semantic representations using convolutional neural networks for web search,” in Proc. WWW, 2014.
- H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, pp. 694–707, 2016.
- D. Rumelhart, G. Hinton, and R. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
- Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. ICLR, 2013.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013.
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, 2009.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and F.-F. Li, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Journal of Neural Computing, vol. 9, pp. 1735–1780, 1997.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in arXiv:1412.3555, 2014.
- J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proc. EMNLP, 2014.
- A. Elkahky, Y. Song, and X. He, “A multi-view deep learning approach for cross domain user modeling in recommendation systems,” in Proc. WWW, 2015.
- X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y.-Y. Wang, “Representation learning using multi-task deep neural networks for semantic classification and information retrieval,” in Proc. NAACL, 2015.
- W.-T. Yih, X. He, and C. Meek, “Semantic parsing for single-relation question answering,” in Proc. ACL, 2014.
- W.-T. Yih, M.-W. Chang, X. He, and J. Gao, “Semantic parsing via staged query graph generation: Question answering with knowledge base,” in Proc. ACL, 2015.
- R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler, “Skip-thought vectors,” in Proc. NIPS, 2015.
- T. Mikolov, W. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proc. NAACL HLT, 2013.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NIPS, 2014.
- A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. ICLR, 2016.
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” in Proc. ICML, 2011.
- N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Proc. NIPS, 2012.
- C. Silberer and M. Lapata, “Learning grounded meaning representations with autoencoders,” in Proc. ACL, 2014.
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proc. CVPR, 2015.
- E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, “Distributional semantics in technicolor,” in Proc. ACL, 2012.
- S. Kottur, R. Vedantam, J. Moura, and D. Parikh, “Visual Word2Vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes,” in Proc. CVPR, 2016.
- X. Yang, P. Ramesh, R. Chitta, S. Madhvanath, E. Bernal, and J. Luo, “Deep multimodal representation learning from temporal data,” in Proc. CVPR, 2017.
- P. Bachman, R. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Proc. NeurIPS, 2019.
- A. Lazaridou, N. Pham, and M. Baroni, “Combining language and vision with a multimodal skip-gram model,” in Proc. NAACL, 2015.
- A. Karpathy, A. Joulin, and F.-F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. NIPS, 2014.
- H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W.-Y. Ma, “Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations,” in Proc. CVPR, 2019.
- K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. ECCV, 2018.
- Y.-H. Tsai, P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov, “Learning factorized multimodal representations,” in Proc. ICLR, 2018.
- T. Gupta, A. Schwing, and D. Hoiem, “ViCo: Word embeddings from visual co-occurrences,” in Proc. ICCV, 2019.
- D.-K. Nguyen and T. Okatani, “Multi-task learning of hierarchical vision-language representation,” in Proc. CVPR, 2019.
- R. Socher, M. Ganjoo, H. Sridhar, M. C. Bastani, O., and A. Ng, “Zeroshot learning through cross-modal transfer,” in Proc. NIPS, 2013.
- A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A deep visual-semantic embedding model,” in Proc. NIPS, 2013.
- Y.-H. Tsai, L.-K. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proc. ICCV, 2017.
- J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov, “Predicting deep zero-shot convolutional neural networks using textual descriptions,” in Proc. ICCV, 2015.
- S. Reed, Z. Akata, B. Schiele, and H. Lee, “Learning deep representations of fine-grained visual descriptions,” in Proc. CVPR, 2016.
- G. Collell and M.-F. Moens, “Do neural network cross-modal mappings really bridge modalities?,” in Proc. ACL, 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017.
- A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 37, pp. 328– 339, 1989.
- J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural networks,” in Proc. ICLR, 2017.
- G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, “UnicoderVL: A universal encoder for vision and language by cross-modal pretraining,” in arXiv:1908.06066, 2019.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VLBERT: Pre-training of generic visuallinguistic representations,” in arXiv:1908.08530, 2019.
- L. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT: A simple and performant baseline for vision and language,” in arXiv:1908.03557, 2019.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. ICCV, 2019.
- C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected objects in text for visual question answering,” in Proc. ICMLC, 2019.
- H. Tan and B. Mohit, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. EMNLP, 2019.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks,” in Proc. NeurIPS, 2019.
- S. Pramanik, P. Agrawal, and A. Hussain, “OmniNet: A unified architecture for multi-modal multi-task learning,” in arXiv:1907.07804, 2019.
- X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” in Proc. ACL, 2019.
- P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan, “Multimodal feature fusion for robust event detection in web videos,” in Proc. CVPR, 2012.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. NIPS, 2014.
- T. Wortwein and S. Scherer, “What really matters An information gain analysis of questions and reactions in automated PTSD screenings,” in Proc. ACII, 2017.
- G. Ye, D. Liu, I.-H. Jhuo, and C. S.-F., “Robust late fusion with rank minimization,” in Proc. CVPR, 2012.
- B. Nojavanasghari, D. Gopinath, J. Koushik, B. T., and L.-P. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proc. ICMI, 2016.
- H. Wang, A. Meghawat, L.-P. Morency, and E. Xing, “Select-additive learning: Improving generalization in multimodal sentiment analysis,” in Proc. ICME, 2017.
- A. Anastasopoulos, S. Kumar, and H. Liao, “Neural language modeling with visual features,” in arXiv:1903.02930, 2019.
- V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie, “CentralNet: A multilayer approach for multimodal fusion,” in Proc. ECCV, 2018.
- B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, “Simple baseline for visual question answering,” in arXiv:1512.02167, 2015.
- J.-M. Perez-Rua, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie, “MFAS: Multimodal fusion architecture search,” in Proc. CVPR, 2019.
- B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in Proc. ICLR, 2017.
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, F.-F. Li, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proc. ECCV, 2018.
- J.-M. Perez-Rua, M. Baccouche, and S. Pateux, “Efficient progressive neural architecture search,” in Proc. BMVC, 2019.
- X. Yang, P. Molchanov, and J. Kautz, “Multilayer and multimodal fusion of deep neural networks for video classification,” in Proc. ACM MM, 2016.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
- A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” in arXiv:1410.5401, 2014.
- Y. Zhu, O. Groth, M. Bernstein, and F.-F. Li, “Visual7W: Grounded question answering in images,” in Proc. CVPR, 2016.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2015.
- K. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in Proc. CVPR, 2016.
- Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. CVPR, 2016.
- H. Xu and K. Saenko, “Ask, attend and answer: Exploring questionguided spatial attention for visual question answering,” in Proc. ECCV, 2016.
- C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in Proc. ICML, 2016.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. CVPR, 2018.
- P. Lu, H. Li, W. Zhang, J. Wang, and X. Wang, “Co-attending freeform regions and detections with multi-modal multiplicative feature embedding for visual question answering,” in Proc. AAAI, 2018.
- W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” in Proc. CVPR, 2019.
- J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Proc. NIPS, 2016.
- H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. CVPR, 2017.
- H. Fan and J. Zhou, “Stacked latent attention for multimodal reasoning,” in Proc. CVPR, 2018.
- A. Osman and W. Samek, “DRAU: Dual recurrent attention units for visual question answering,” Computer Vision and Image Understanding, vol. 185, pp. 24–30, 2019.
- I. Schwartz, A. Schwing, and T. Hazan, “High-order attention models for visual question answering,” in Proc. NIPS, 2017.
- J. Arevalo, T. Solorio, M. Montes-y Gomez, and F. Gonzalez, “Gated multimodal units for information fusion,” in Proc. ICLR, 2017.
- J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang, “Multimodal residual learning for visual QA,” in Proc. NIPS, 2016.
- H. Noh, P. Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” in Proc. CVPR, 2016.
- J. Tenenbaum and W. Freeman, “Separating style and content with bilinear models,” Neural Computing, vol. 12, pp. 1247–1283, 2000.
- A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. EMNLP, 2017.
- Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in Proc. CVPR, 2016.
- M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in Proc. ICALP, 2012.
- N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in Proc. SIGKDD, 2013.
- A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. EMNLP, 2016.
- J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in Proc. ICLR, 2017.
- Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in Proc. ICCV, 2017.
- Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, pp. 5947–5959, 2018.
- L. Tucker, “Some mathematical notes on three-mode factor analy,” Psychometrika, vol. 31, pp. 279–311, 1966.
- H. Ben-younes, R. Cadene, M. Cord, and N. Thome, “MUTAN: Multimodal tucker fusion for visual question answering,” in Proc. ICCV, 2017.
- L. Lathauwer, “Decompositions of a higher-order tensor in block termspart II: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, pp. 1033–1066, 2008.
- H. Ben-younes, R. Cadene, N. Thome, and M. Cord, “BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection,” in Proc. AAAI, 2019.
- Z. Liu, Y. Shen, V. Lakshminarasimhan, P. Liang, A. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. ACL, 2018.
- J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” in Proc. NeurIPS, 2018.
- J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, “Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models,” in Proc. CVPR, 2018.
- X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, and R. Feris, “Dialogbased interactive image retrieval,” in Proc. CVPR, 2018.
- A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proc. CVPR, 2018.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, and N. Sunderhauf, “Vision-and-language navigation: Interpreting visuallygrounded navigation instructions in real environments,” in Proc. CVPR, 2018.
- V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” in Proc. ACL, 2019.
- R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, and K. Saenko, “Are you looking? Grounding to multiple modalities in vision-and-language navigation,” in Proc. ACL, 2019.
- H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “TOUCHDOWN: Natural language navigation and spatial reasoning in visual street environments,” in Proc. CVPR, 2019.
- L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in Proc. CVPR, 2019.
- X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Wang, and L. Zhang, “Reinforced cross-modal matching and selfsupervised imitation learning for vision-language navigation,” in Proc. CVPR, 2019.
- C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” in Proc. ICLR, 2019.
- J. Fu, A. Korattikara, S. Levine, and S. Guadarrama, “From language to goals: Inverse reinforcement learning for vision-based instruction following,” in Proc. ICLR, 2019.
- X. He and L. Deng, “Deep learning for image-to-text generation: A technical overview,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109–116, 2017.
- R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. CVPR, 2015.
- J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv preprint arXiv:1412.6632, 2014.
- X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recurrent visual representation for image caption generation,” in Proc. CVPR, 2015.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2015.
- J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. CVPR, 2017.
- Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proc. CVPR, 2017.
- A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” in Proc. CVPR, 2019.
- Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,” in Proc. ECCV, 2016.
- K. Tran, X. He, L. Zhang, and J. Sun, “Rich image captioning in the wild,” in Proc. CVPR Workshop, 2016.
- C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “StyleNet: Generating attractive visual captions with styles,” in Proc. CVPR, 2017.
- D. Li, Q. Huang, X. He, L. Zhang, and M.-T. Sun, “Generating diverse and accurate visual captions by comparative adversarial learning,” in arXiv:1804.00861, 2018.
- A. Graves, “Generating sequences with recurrent neural networks,” in arXiv:1308.0850, 2013.
- K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” in Proc. ICML, 2015.
- E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” in Proc. ICLR, 2016.
- M. Mirza and S. Osindero, “Conditional generative adversarial nets,” in arXiv:1411.1784, 2014.
- E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Proc. NIPS, 2015.
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. ICML, 2016.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proc. NIPS, 2016.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proc. NIPS, 2017.
- A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier GANs,” 2017.
- Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proc. CVPR, 2018.
- H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proc. ICCV, 2017.
- H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “StackGAN++: Realistic image synthesis with stacked generative adversarial networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 1947–1962, 2019.
- M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proc. CVPR, 2019.
- A. Dash, J. Gamboal, S. Ahmed, M. Liwicki, and M. Afzal, “TACGAN – Text conditioned auxiliary classifier generative adversarial network,” in Proc. CVPR, 2017.
- M. Cha, Y. Gwon, and H. Kung, “Adversarial learning of semantic relevance in text to image synthesis,” in Proc. AAAI, 2019.
- X. Chen, M. Rohrbach, and D. Parikh, “Cycle-consistency for robust visual question answering,” in Proc. CVPR, 2019.
- T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-toimage generation by redescription,” in Proc. CVPR, 2019.
- P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD birds 200,” Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.
- M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flower classification,” in Proc. CVPR, 2006.
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. Zitnick, and P. Dollar, “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014.
- S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” in Proc. NIPS, 2016.
- J. Johnson, A. Gupta, and F.-F. Li, “Image generation from scene graphs,” in Proc. CVPR, 2018.
- S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, “Heuristics for image generation from scene graphs,” in Proc. ICLR Workshop LLD, 2019.
- B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from layout,” in Proc. CVPR, 2019.
- T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at spatially distinct locations,” in Proc. ICLR, 2019.
- S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for hierarchical text-to-image synthesis,” in Proc. CVPR, 2018.
- X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
- Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute editing by only changing what you want,” IEEE Transactions on Image Processing, vol. 28, pp. 5464–5478, 2019.
- S. Nam, Y. Kim, and S. Kim, “Text-adaptive generative adversarial networks: Manipulating images with natural language,” in Proc. NeurIPS, 2018.
- Q. Lao, M. Havaei, A. Pesaranghader, F. Dutil, L. Jorio, and T. Fevens, “Dual adversarial inference for text-to-image synthesis,” in Proc. ICCV, 2019.
- F. Tan, S. Feng, and V. Ordonez, “Text2Scene: Generating compositional scenes from textual descriptions,” in Proc. CVPR, 2019.
- S. Sharma, D. Suhubdy, V. Michalski, S. Kahou, and Y. Bengio, “ChatPainter: Improving text to image generation using dialogue,” in Proc. ICLR Workshop, 2018.
- A. El-Nouby, S. Sharma, H. Schulz, D. Hjelm, L. Asri, S. Kahou, Y. Bengio, and G. Taylor, “Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction,” in Proc. ICCV, 2019.
- P. Cascante-Bonilla, X. Yin, V. Ordonez, and S. Feng, “Chat-crowd: A dialog-based platform for visual layout composition,” in Proc. NAACLHLT, 2018.
- Y. Chen, Z. Gan, Y. Li, J. Liu, and J. Gao, “Sequential attention GAN for interactive image editing via dialogue,” in Proc. AAAI, 2019.
- J.-H. Kim, N. Kitaev, X. Chen, M. Rohrbach, B.-T. Zhang, Y. Tian, D. Batra, and D. Parikh, “CoDraw: Collaborative drawing as a testbed for grounded goal-driven communication,” in Proc. ACL, 2019.
- Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, and J. Gao, “StoryGAN: A sequential conditional GAN for story visualization,” in Proc. CVPR, 2019.
- Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation from text,” in Proc. AAAI, 2018.
- Y. Balaji, M. Min, B. Bai, R. Chellappa, and H. Graf, “Conditional GAN with discriminative filter generation for text-to-video synthesis,” in Proc. IJCAI, 2019.
- M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on yncertain input,” in Proc. NIPS, 2014.
- M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in Proc. ICCV, 2015.
- M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in Proc. NIPS, 2015.
- Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel, “Are you talking to me? Reasoned visual dialog generation through adversarial learning,” in Proc. CVPR, 2018.
- U. Jain, Z. Zhang, and A. Schwing, “Creativity: Generating diverse questions using variational autoencoders,” in Proc. CVPR, 2017.
- A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra, “Visual dialogue,” in Proc. CVPR, 2017.
- H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville, “GuessWhat?! Visual object discovery through multimodal dialogue,” in Proc. CVPR, 2017.
- Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” International Journal of Computer Vision, vol. 127, pp. 398–414, 2019.
- P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “Explicit knowledge-based reasoning for visual question answering,” in Proc. IJCAI, 2017.
- P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “FVQA: Fact-based visual question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 2413–2427, 2018.
- K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: A visual question answering benchmark requiring external knowledge,” in Proc. CVPR, 2019.
- D. Hudson and C. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” in Proc. CVPR, 2019.
- A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Dont just assume; Look and answer: Overcoming priors for visual question answering,” in Proc. CVPR, 2018.
- S. Ramakrishnan, A. Agrawal, and S. Lee, “Overcoming language priors in visual question answering with adversarial regularization,” in Proc. NeurIPS, 2018.
- R. Cadene, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh, “RUBi: Reducing unimodal biases in visual question answering,” in Proc. NeurIPS, 2019.
- J.-Y. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017.
- Y. Zhang, J. Hare, and A. Prugel-Bennett, “Learning to count objects in natural images for visual question answering,” in Proc. ICLR, 2018.
- A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proc. CVPR, 2019.
- D. Gurari, Q. Li, A. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. Bigham, “VizWiz grand challenge: Answering visual questions from blind people,” in Proc. CVPR, 2018.
- E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI, 2018.
- R. Cadene, H. Ben-younes, M. Cord, and N. Thome, “MUREL: Multimodal relational reasoning for visual question answering,” in Proc. CVPR, 2019.
- J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in Proc. CVPR, 2016.
- J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Learning to compose neural networks for question answering,” in Proc. NAACL, 2016.
- R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in Proc. ICCV, 2017.
- J. Johnson, B. Hariharan, L. van der Maaten, F.-F. Li, C. Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. CVPR, 2017.
- J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F.-F. Li, C. Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in Proc. ICCV, 2017.
- R. Hu, J. Andreas, T. Darrell, and K. Saenko, “Explainable neural computation via stack neural module networks,” in Proc. ECCV, 2018.
- D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, “Transparency by design: Closing the gap between performance and interpretability in visual reasoning,” in Proc. CVPR, 2018.
- D. Hudson and C. Manning, “Compositional attention networks for machine reasoning,” in Proc. ICLR, 2018.
- K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neuralsymbolic VQA: Disentangling reasoning from vision and language understanding,” in Proc. NeurIPS, 2018.
- R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh, “Probabilistic neural-symbolic models for interpretable visual question answering,” in Proc. ICML, 2018.
- J. Mao, C. Gan, P. Kohli, J. Tenenbaum, and J. Wu, “The neurosymbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” in Proc. ICLR, 2019.
- A. Santoro, D. Raposo, D. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Proc. NIPS, 2017. He was an elected member of the Board of Governors of the IEEE Signal Processing Society, and was Editors-in-Chief of IEEE Signal Processing Magazine and of IEEE/ACM Transactions on Audio, Speech, and Language Processing (2008-2014), for which he received the IEEE SPS Meritorious Service Award. In recognition of the pioneering work on disrupting speech recognition industry using large-scale deep learning, he received the 2015 IEEE SPS Technical Achievement Award for Outstanding Contributions to Automatic Speech Recognition and to Deep Learning. He also received dozens of best paper and patent awards for the contributions to artificial intelligence, machine learning, information retrieval, multimedia signal processing, speech processing and recognition, and human language technology. He is an author or co-author of six technical books on deep learning, speech processing, pattern recognition and machine learning, and, the latest, natural language processing (Springer, June 2018).
- Chao Zhang is an advisor of JD.com speech team, and a research associate in speech and natural language processing at the University of Cambridge. He received his B.E. and M.S. degrees in 2009 and 2012 respectively, both from the Department of Computer Science & Technology, Tsinghua University, and a Ph.D. degree in 2017 from Cambridge University Engineering Department.
- Xiaodong He (IEEE Member 2003, Senior member 2008, Fellow 2019) is the Deputy Managing Director of JD AI Research, and Head of the Deep learning, NLP and Speech Lab. He is also Affiliate Professor of ECE at the University of Washington (Seattle). His research interests are mainly in deep learning, natural language processing, speech recognition, computer vision, information retrieval, and multimodal intelligence. He has held editorial positions on multiple IEEE Journals and the Transactions of the ACL, and served in the organizing committee/program committee of major speech and language processing conferences. He is a member of the IEEE SLTC for the term of 2015-2017 and the Chair of the IEEE Seattle Section in 2016. He received the Bachelor degree from Tsinghua University in 1996, MS degree from Chinese Academy of Sciences in 1999, and the PhD degree from the University of Missouri – Columbia in 2003.
Full Text
Tags
Comments