X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment
CoRR(2024)
摘要
The impressive development of large language models (LLMs) is expanding into
the realm of large multimodal models (LMMs), which incorporate multiple types
of data beyond text. However, the nature of multimodal models leads to
significant expenses in the creation of training data. Furthermore,
constructing multilingual data for LMMs presents its own set of challenges due
to language diversity and complexity. Therefore, in this study, we propose two
cost-effective methods to solve this problem: (1) vocabulary expansion and
pretraining of multilingual LLM for specific languages, and (2) automatic and
elaborate construction of multimodal datasets using GPT4-V. Based on015 these
methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal
training dataset. Additionally, we developed a bilingual multimodal model that
exhibits excellent performance in both Korean and English, surpassing existing
approaches.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要