File size: 6,443 Bytes
e41ea72
 
 
 
 
 
 
 
 
 
999f1b2
e41ea72
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
 
 
 
4d73568
b29529e
4d73568
b29529e
4d73568
4c2b0c3
b29529e
824f1b9
b29529e
 
 
 
 
 
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
 
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
 
 
 
4d73568
b29529e
 
 
 
 
4d73568
b29529e
 
4d73568
b29529e
 
 
 
4d73568
b29529e
 
 
 
 
4d73568
b29529e
 
4d73568
b29529e
 
 
 
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
 
 
 
824f1b9
 
 
 
 
 
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
 
 
 
 
84ba833
b29529e
 
4d73568
b29529e
4d73568
b29529e
4d73568
b29529e
 
4d73568
b29529e
 
 
4d73568
b29529e
4d73568
b29529e
 
 
 
 
 
4d73568
b29529e
4d73568
e41ea72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
language:
- en
metrics:
- accuracy
library_name: peft
tags:
- code
- security
pipeline_tag: text-generation
---
# CodeAstra-7b: State-of-the-Art Vulnerability Detection Model πŸ”πŸ›‘οΈ

## Model Description

CodeAstra-7b is a state-of-the-art language model fine-tuned for vulnerability detection in multiple programming languages. Based on the powerful Mistral-7B-Instruct-v0.2 model, CodeAstra-7b has been specifically trained to identify potential security vulnerabilities across a wide range of popular programming languages.

### Key Features

- 🌐 **Multi-language Support**: Detects vulnerabilities in Go, Python, C, C++, Fortran, Ruby, Java, Kotlin, C#, PHP, Swift, JavaScript, and TypeScript.
- πŸ† **State-of-the-Art Performance**: Achieves cutting-edge results in vulnerability detection tasks.
- πŸ“Š **Custom Dataset**: Trained on a proprietary dataset curated for comprehensive vulnerability detection.
- πŸ–₯️ **Large-scale Training**: Utilized A100 GPUs for efficient and powerful training.

## Performance Comparison πŸ“Š

CodeAstra-7b significantly outperforms existing models in vulnerability detection accuracy. Here's a comparison table:

|Model       | Accuracy (%) |
|-------------|--------------|
| gpt4o       | 88.78
| CodeAstra-7b| 83.00        |
| codebert-base-finetuned-detect-insecure-code        | 65.30        |
| CodeBERT    | 62.08        |
| RoBERTa     | 61.05        |
| TextCNN     | 60.69        |
| BiLSTM      | 59.37        |

As shown in the table, CodeAstra-7b achieves an impressive 83% accuracy, substantially surpassing other state-of-the-art models in the field of vulnerability detection.

## Intended Use

CodeAstra-7b is designed to assist developers, security researchers, and code auditors in identifying potential security vulnerabilities in source code. It can be integrated into development workflows, code review processes, or used as a standalone tool for code analysis.


### Multiple Vulnerability Scenarios

It's important to note that while CodeAstra-7b excels at finding security issues in most cases, its performance may vary when multiple vulnerabilities are present in the same code snippet. In scenarios where two or three vulnerabilities coexist, the model might not always identify all of them correctly. Users should be aware of this limitation and consider using the model as part of a broader, multi-faceted security review process.

## Training πŸ‹οΈβ€β™‚οΈ

CodeAstra-7b was fine-tuned from the Mistral-7B-Instruct-v0.2 base model using a custom dataset specifically compiled for vulnerability detection across multiple programming languages. The training process leveraged A100 GPUs to ensure optimal performance and efficiency.

## Usage πŸ’»

CodeAstra-7b was trained using PEFT (Parameter-Efficient Fine-Tuning). To use the model for vulnerability detection and code quality analysis, you can leverage the Hugging Face Transformers library along with PEFT. Here's how to get started:

```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
peft_model_id = "rootxhacker/CodeAstra-7B"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

def get_completion(query, model, tokenizer):
    inputs = tokenizer(query, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
code_to_analyze = """
def user_input():
    name = input("Enter your name: ")
    print("Hello, " + name + "!")

user_input()
"""

query = f"Analyze this code for vulnerabilities and quality issues:\n{code_to_analyze}"
result = get_completion(query, model, tokenizer)
print(result)
```

This script loads the CodeAstra-7b model, tokenizer, and provides a function to generate completions. You can use this setup to analyze code for vulnerabilities and quality issues.

## Limitations ⚠️

While CodeAstra-7b represents a significant advancement in automated vulnerability detection and code quality analysis, it's important to note that:

1. The model may not catch all vulnerabilities or code quality issues and should be used as part of a comprehensive security and code review strategy.
2. In cases where multiple vulnerabilities (two or three) are present in the same code snippet, the model might not identify all of them correctly.
3. False positives are possible, and results should be verified by human experts.
4. The model's performance may vary depending on the complexity and context of the code being analyzed.
5. CodeAstra's performance depends on input code snippet length.

## Test Aparatus

I tested CodeAstra-7b against code snippets from dataset such as Cvefix , YesWeHack vulnerable code repository , Synthetically generated code using LLMs aand OWASP Juice Shop source code  
I ran all those vulnerable scripts against LLMs such as GPT4 , GPT4o etc for evaluation 

## Citation πŸ“œ

If you use CodeAstra-7b in your research or project, please cite it as follows:

```
@software{CodeAstra-7b,
  author = {Harish Santhanalakshmi Ganesan},
  title = {CodeAstra-7b: State-of-the-Art Vulnerability Detection Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/rootxhacker/CodeAstra-7b}}
}
```

## License πŸ“„

CodeAstra-7b is released under the Apache License 2.0.

```
Copyright 2024 [Harish Santhanalakshmi Ganesan]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

## Acknowledgements πŸ™

We would like to thank the Mistral AI team for their excellent base model, which served as the foundation for CodeAstra-7b.