CyberMetric: A Benchmark Dataset for Evaluating Large Language Models Knowledge in Cybersecurity
CoRR(2024)
摘要
Large Language Models (LLMs) excel across various domains, from computer
vision to medical diagnostics. However, understanding the diverse landscape of
cybersecurity, encompassing cryptography, reverse engineering, and managerial
facets like risk assessment, presents a challenge, even for human experts. In
this paper, we introduce CyberMetric, a benchmark dataset comprising 10,000
questions sourced from standards, certifications, research papers, books, and
other publications in the cybersecurity domain. The questions are created
through a collaborative process, i.e., merging expert knowledge with LLMs,
including GPT-3.5 and Falcon-180B. Human experts spent over 200 hours verifying
their accuracy and relevance. Beyond assessing LLMs' knowledge, the dataset's
main goal is to facilitate a fair comparison between humans and different LLMs
in cybersecurity. To achieve this, we carefully selected 80 questions covering
a wide range of topics within cybersecurity and involved 30 participants of
diverse expertise levels, facilitating a comprehensive comparison between human
and machine intelligence in this area. The findings revealed that LLMs
outperformed humans in almost every aspect of cybersecurity.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要