Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection
Abstract
Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.
Community
Recent advancements in generative AI have led to the widespread adoption of
large language models (LLMs) in software engineering, addressing numerous
long-standing challenges. However, a comprehensive study examining the
capabilities of LLMs in software vulnerability detection (SVD), a crucial
aspect of software security, is currently lacking. Existing research primarily
focuses on evaluating LLMs using C/C++ datasets. It typically explores only one
or two strategies among prompt engineering, instruction tuning, and sequence
classification fine-tuning for open-source LLMs. Consequently, there is a
significant knowledge gap regarding the effectiveness of diverse LLMs in
detecting vulnerabilities across various programming languages. To address this
knowledge gap, we present a comprehensive empirical study evaluating the
performance of LLMs on the SVD task. We have compiled a comprehensive dataset
comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in
JavaScript. We assess five open-source LLMs using multiple approaches,
including prompt engineering, instruction tuning, and sequence classification
fine-tuning. These LLMs are benchmarked against five fine-tuned small language
models and two open-source static application security testing tools.
Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data
perspective: Retraining models using downsampled balanced datasets. b) Model
perspective: Investigating ensemble learning methods that combine predictions
from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains
a challenging task for LLMs. This study provides a thorough understanding of
the role of LLMs in SVD and offers practical insights for future advancements
in leveraging generative AI to enhance software security practices.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories (2025)
- Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching (2025)
- A Survey On Large Language Models For Code Generation (2025)
- Large Language Models for In-File Vulnerability Localization Can Be"Lost in the End" (2025)
- CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection (2025)
- An Empirical Study on Commit Message Generation using LLMs via In-Context Learning (2025)
- Code Summarization Beyond Function Level (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper