Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open Questions

ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction（2023）

引用 0|浏览16

暂无评分

摘要

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents capable of understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in healthcare and robotics, multimodality has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this tutorial is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. Building upon a new edition of our survey paper on multimodal ML and academic courses at CMU, this tutorial will cover three topics: (1) what is multimodal: the principles in learning from heterogeneous, connected, and interacting data, (2) why is it hard: a taxonomy of six core technical challenges faced in multimodal ML but understudied in unimodal ML, and (3) what is next: major directions for future research as identified by our taxonomy.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要