X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
Annual Meeting of the Association for Computational Linguistics(2024)
Abstract
Large language models respond well in high-resource languages like Englishbut struggle in low-resource languages. It may arise from the lack ofhigh-quality instruction following data in these languages. Directlytranslating English samples into these languages can be a solution butunreliable, leading to responses with translation errors and lackinglanguage-specific or cultural knowledge. To address this issue, we propose anovel method to construct cross-lingual instruction following samples withinstruction in English and response in low-resource languages. Specifically,the language model first learns to generate appropriate English instructionsaccording to the natural web texts in other languages as responses. Thecandidate cross-lingual instruction tuning samples are further refined anddiversified. We have employed this method to build a large-scale cross-lingualinstruction tuning dataset on 10 languages, namely X-Instruction. Theinstruction data built using our method incorporate more language-specificknowledge compared with the naive translation method. Experimental results haveshown that the response quality of the model tuned on X-Instruction greatlyexceeds the model distilled from a powerful teacher model, reaching or evensurpassing the ones of ChatGPT. In addition, we find that models tuned oncross-lingual instruction following samples can follow the instruction in theoutput language without further tuning.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined