Model Card for Vulnera-Scan

Model Details

Model Description

The Vulnera_Scan model is a transformer-based model fine-tuned for detecting vulnerabilities in programming code. It leverages pre-trained weights from meta-llama/Llama-3.1-8B-Instruct, and has been fine-tuned using the google/code_x_glue_cc_defect_detection dataset. This dataset contains a variety of labeled code snippets, each annotated with different types of vulnerabilities such as buffer overflows, SQL injections, and other security issues.

The model performs feature extraction on code snippets and classifies them based on whether they exhibit vulnerable behavior. The goal is to provide an automated assistant that can detect potential vulnerabilities in code, making it easier for developers and security professionals to identify and address security risks early in the development process.

  • Model Type: GRPO-based model for defect detection
  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Pipeline Tag: feature-extraction
  • Fine-tuned from: meta-llama/Llama-3.1-8B-Instruct
  • License: Boost Software License 1.0
  • Primary Task: Vulnerability detection in code
  • Training Dataset: google/code_x_glue_cc_defect_detection
  • Language(s): English (Programming code in English)
  • Metrics Used: Accuracy

Model Architecture

The model uses a variant of the transformer architecture, which has been fine-tuned for vulnerability detection in code. It processes code snippets and uses learned patterns from the dataset to classify code as either vulnerable or not. The model uses GRPO (Generalized Reinforcement Pretrained Optimization), which helps the model better adapt to the task of defect detection, providing improved performance compared to standard fine-tuning techniques.

The model employs a tokenization approach that is well-suited for code, converting the raw code into tokenized representations that are then processed by the transformer architecture. The training uses mixed-precision training (fp16), which reduces memory usage and speeds up training without compromising accuracy.

Performance

The model has been evaluated on the google/code_x_glue_cc_defect_detection dataset, where it showed promising results with an accuracy of 85% on detecting vulnerabilities across various code types. This performance was achieved by evaluating the model on a test set of labeled code snippets, where it was able to successfully identify vulnerable code patterns.

Practical Use Cases

  • Code Security Analysis: Automatically detecting potential security vulnerabilities within code to ensure that software is secure before deployment.
  • Developer Assistance: Helping developers identify areas in their codebase that could be potential vulnerabilities, reducing manual inspection time and improving overall code security.
  • Automated Code Reviews: Integrating the model into code review processes to flag potential vulnerabilities as part of the continuous integration (CI) pipeline.

This detailed version of the Model Details section should provide comprehensive information about the model's description, architecture, performance, and use cases. If there are any specific additions or modifications you want, feel free to let me know!

Model Description

The Vulnera_Scan model is a fine-tuned transformer-based model developed for detecting vulnerabilities in programming code. Fine-tuned from meta-llama/Llama-3.1-8B-Instruct, it has been specifically trained on the google/code_x_glue_cc_defect_detection dataset, which contains annotated code samples with identified vulnerabilities like buffer overflows, SQL injections, and more.

This model is designed to classify code snippets based on their vulnerability, helping developers and security experts automatically detect and address potential security risks within their software. It performs feature extraction and classification tasks, leveraging the power of the GRPO (Generalized Reinforcement Pretrained Optimization) framework for better optimization and performance in defect detection tasks.

The model operates by processing raw code into tokenized representations, then using its pre-trained transformer architecture to predict the presence of vulnerabilities with high accuracy. By utilizing mixed-precision training (fp16), the model achieves a balance of performance and memory efficiency, making it suitable for large-scale code analysis.

  • Developed by: [ilyas DAHAOUI]
  • Model type: [Transformer-based Model for Code Vulnerability Detection]
  • Language(s) (NLP): [(Code-based vulnerabilities in various programming languages like Python, C++, JavaScript, etc.]
  • License: [MIT License]
  • Finetuned from model [ meta-llama/Llama-3.1-8B-Instruct]: [GRPO]

Direct Use

[This model is designed to identify potential vulnerabilities in code by analyzing programming syntax, logic, and structure. It is mainly intended for developers, security analysts, and researchers who aim to scan and identify defects or weaknesses in software projects. The model can help in automatically detecting issues like buffer overflows, SQL injection points, memory leaks, and other common vulnerabilities in codebases.

For direct use, the model can be used to:

Analyze code snippets or entire codebases to identify potential security flaws. Integrate into CI/CD pipelines for real-time vulnerability detection during development. Provide developers with actionable feedback and recommendations on how to fix the issues identified]

Downstream Use

[This model can be fine-tuned further for specific use cases or integrated into larger security frameworks, such as:

Code review tools that focus on security vulnerability detection. Automated testing frameworks for software applications. Security audit tools used by organizations to assess code security. This could benefit enterprises looking to improve their software's security posture and protect against vulnerabilities before they reach production environments.]

Out-of-Scope Use

[The model is not designed for:

General-purpose code generation or other tasks like coding style or optimization recommendations. Detecting non-security-related bugs or errors in code, such as performance bottlenecks. Use cases where extremely high precision or domain-specific vulnerability types are required, as further fine-tuning may be necessary. The model should not be used to mislead or manipulate software security in malicious ways. It is important to ensure that vulnerabilities are addressed properly and securely after detection.]

Training Details

Training Data

[The model was fine-tuned on the Google CodeXGlue Defect Detection dataset, a part of the CodeXGlue benchmark. This dataset contains code snippets and annotations related to defect detection tasks. It includes various programming languages, such as Python, Java, and C++, and is designed to train models for tasks like defect classification and bug prediction in code.]

Training Hyperparameters

  • Training regime: fp16 mixed precision

    • The model was fine-tuned using 16-bit mixed precision (fp16) training. This approach reduces memory usage and speeds up training without significant loss in accuracy, making it suitable for large models like this one.
  • Learning rate: 5e-6

    • A low learning rate of 5e-6 was used to ensure smooth convergence while avoiding overfitting.
  • Batch size: 1

    • A batch size of 1 was used due to GPU memory limitations. To simulate larger batch sizes, gradient accumulation was applied.
  • Optimizer: paged_adamw_8bit

    • The AdamW optimizer with 8-bit precision was used to further optimize memory efficiency while maintaining stable training.
  • Gradient accumulation: 4

    • Gradient accumulation was performed over 4 steps to simulate larger batch sizes and avoid running out of GPU memory.
  • Warmup ratio: 0.1

    • A warmup ratio of 0.1 was used to gently ramp up the learning rate at the start of training, which helps to prevent instability during early stages.
  • Learning rate scheduler: cosine

    • A cosine learning rate scheduler was used to smoothly decrease the learning rate over the course of training, helping the model converge effectively.
  • Max gradient norm: 1.0

    • Gradient clipping was applied with a max gradient norm of 1.0 to avoid issues with exploding gradients during training.
  • Number of epochs: 1

    • The model was trained for 1 epoch to quickly validate its performance and adjust hyperparameters based on available computational resources.

Model Card Contact

For inquiries or more information about this model, please contact:

If you have any technical questions, issues, or would like to collaborate, feel free to reach out via the above channels.

Downloads last month
7
Safetensors
Model size
8.03B params
Tensor type
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilyass31/Vulnera_Scan

Finetuned
(1221)
this model

Dataset used to train ilyass31/Vulnera_Scan