谷歌浏览器插件
订阅小程序
在清言上使用

Multi-modal Universal Embedding Representations for Language Understanding

Frontiers in Cyber Security(2022)

引用 0|浏览0
暂无评分
摘要
In recent years, machine learning has made good progress in Computer Vision (CV), Natural Language Processing (NLP), and Vision + Language (V + L). However, most existing pre-training models just focus on single-modal (i.e., using only linguistic or visual features for training) or multi-modal (i.e., using both linguistic and visual features for training) scenarios, and can only use single-modal data or limited multi-modal data. This means that models for different scenarios need to be pre-trained separately, which requires a lot of computing resources and time. In this paper, we propose a universal method to train a general pre-training model to solve the problems in different scenarios and modalities. Moreover, we find that the model pre-trained with multi-modal data performs better in the single-modal downstream tasks. We use the General Language Understanding Evaluation (GLUE) benchmark for single-modal tasks to evaluate our model, which outperforms Bidirectional Encoder Representations from Transformers (BERT) in four tasks. For Vision + Language (V + L) tasks, we test our model on downstream tasks such as Visual Question Answering (VQA) and achieve similar performance to the current top-level model.
更多
查看译文
关键词
Single-modality, Multi-modality, Pre-training, Fine-tuning, General learning method
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要