Papers
arxiv:2409.01944

FuzzCoder: Byte-level Fuzzing Test via Large Language Model

Published on Sep 3
· Submitted by zhangysk on Sep 6
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.

Community

Paper author Paper submitter

Fuzzing is a critical dynamic program analysis technique for identifying vulnerabilities in complex software. It involves presenting a target program with carefully crafted inputs to induce crashes, buffer overflows, memory errors, and exceptions. However, efficiently generating effective malicious inputs remains a challenging open problem, with current best practices often relying on uniform random mutations of existing valid inputs.

In this work, we propose FuzzCoder, a novel approach that leverages fine-tuned large language models (LLMs) to enhance fuzzing efficiency. Our method learns patterns from successful attack inputs to guide future fuzzing explorations. We develop a framework that utilizes code LLMs to steer the input mutation process in fuzzing, formulating it as a sequence-to-sequence modeling task where the LLM receives a byte sequence and outputs a mutated version.

FuzzCoder is fine-tuned on FuzzBench, a custom instruction dataset compiled from successful fuzzing histories gathered using heuristic fuzzing tools. This enables FuzzCoder to predict optimal mutation locations and strategies within input files, increasing the likelihood of triggering abnormal program behaviors.

We integrate FuzzCoder with AFL (American Fuzzy Lop) and evaluate its performance across various input formats, including ELF, JPG, MP3, and XML. Experimental results demonstrate significant improvements in two key metrics: the effective proportion of mutation (EPM) and the number of crashes (NC) detected.

Sources:
Paper: https://arxiv.org/pdf/2409.01944
Code: https://github.com/weimo3221/FUZZ-CODER

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This comment has been hidden

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.01944 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.01944 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.01944 in a Space README.md to link it from this page.

Collections including this paper 4