EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models
CoRR(2024)
摘要
In the field of environmental science, it is crucial to have robust
evaluation metrics for large language models to ensure their efficacy and
accuracy. We propose EnviroExam, a comprehensive evaluation method designed to
assess the knowledge of large language models in the field of environmental
science. EnviroExam is based on the curricula of top international
universities, covering undergraduate, master's, and doctoral courses, and
includes 936 questions across 42 core courses. By conducting 0-shot and 5-shot
tests on 31 open-source large language models, EnviroExam reveals the
performance differences among these models in the domain of environmental
science and provides detailed evaluation standards. The results show that 61.3
of the models passed the 5-shot tests, while 48.39
introducing the coefficient of variation as an indicator, we evaluate the
performance of mainstream open-source large language models in environmental
science from multiple perspectives, providing effective criteria for selecting
and fine-tuning language models in this field. Future research will involve
constructing more domain-specific test sets using specialized environmental
science textbooks to further enhance the accuracy and specificity of the
evaluation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要