arxiv:2503.01449

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

Published on Mar 3

· Submitted by

Authors:

Abstract

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

View arXiv page View PDF Add to collection

Community

vansin

Paper submitter 3 days ago

Recent advancements in generative AI have led to the widespread adoption of
large language models (LLMs) in software engineering, addressing numerous
long-standing challenges. However, a comprehensive study examining the
capabilities of LLMs in software vulnerability detection (SVD), a crucial
aspect of software security, is currently lacking. Existing research primarily
focuses on evaluating LLMs using C/C++ datasets. It typically explores only one
or two strategies among prompt engineering, instruction tuning, and sequence
classification fine-tuning for open-source LLMs. Consequently, there is a
significant knowledge gap regarding the effectiveness of diverse LLMs in
detecting vulnerabilities across various programming languages. To address this
knowledge gap, we present a comprehensive empirical study evaluating the
performance of LLMs on the SVD task. We have compiled a comprehensive dataset
comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in
JavaScript. We assess five open-source LLMs using multiple approaches,
including prompt engineering, instruction tuning, and sequence classification
fine-tuning. These LLMs are benchmarked against five fine-tuned small language
models and two open-source static application security testing tools.
Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data
perspective: Retraining models using downsampled balanced datasets. b) Model
perspective: Investigating ensemble learning methods that combine predictions
from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains
a challenging task for LLMs. This study provides a thorough understanding of
the role of LLMs in SVD and offers practical insights for future advancements
in leveraging generative AI to enhance software security practices.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.01449 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.01449 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01449 in a Space README.md to link it from this page.