Papers
arxiv:2402.07688

CyberMetric: A Benchmark Dataset for Evaluating Large Language Models Knowledge in Cybersecurity

Published on Feb 12
Authors:
,
,
,

Abstract

Large Language Models (LLMs) excel across various domains, from computer vision to medical diagnostics. However, understanding the diverse landscape of cybersecurity, encompassing cryptography, reverse engineering, and managerial facets like risk assessment, presents a challenge, even for human experts. In this paper, we introduce CyberMetric, a benchmark dataset comprising 10,000 questions sourced from standards, certifications, research papers, books, and other publications in the cybersecurity domain. The questions are created through a collaborative process, i.e., merging expert knowledge with LLMs, including GPT-3.5 and Falcon-180B. Human experts spent over 200 hours verifying their accuracy and relevance. Beyond assessing LLMs' knowledge, the dataset's main goal is to facilitate a fair comparison between humans and different LLMs in cybersecurity. To achieve this, we carefully selected 80 questions covering a wide range of topics within cybersecurity and involved 30 participants of diverse expertise levels, facilitating a comprehensive comparison between human and machine intelligence in this area. The findings revealed that LLMs outperformed humans in almost every aspect of cybersecurity.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.07688 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.07688 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.07688 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.