Update model card with full test set evaluation metrics (5K samples)
Browse files
README.md
CHANGED
|
@@ -29,16 +29,20 @@ model-index:
|
|
| 29 |
dataset:
|
| 30 |
type: ayshajavd/code-security-vulnerability-dataset
|
| 31 |
name: Code Security Vulnerability Dataset
|
|
|
|
| 32 |
metrics:
|
| 33 |
- type: f1
|
| 34 |
-
value: 0.
|
| 35 |
name: Weighted F1
|
| 36 |
- type: f1
|
| 37 |
-
value: 0.
|
| 38 |
name: Micro F1
|
| 39 |
- type: f1
|
| 40 |
-
value: 0.
|
| 41 |
-
name:
|
|
|
|
|
|
|
|
|
|
| 42 |
---
|
| 43 |
|
| 44 |
# GraphCodeBERT Vulnerability Classifier
|
|
@@ -76,7 +80,7 @@ TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94
|
|
| 76 |
"CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
|
| 77 |
"CWE-787", "CWE-798", "CWE-918"]
|
| 78 |
|
| 79 |
-
threshold = 0.
|
| 80 |
for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
|
| 81 |
if prob > threshold:
|
| 82 |
print(f"{cwe}: {prob:.3f}")
|
|
@@ -86,12 +90,12 @@ for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
|
|
| 86 |
|
| 87 |
| Property | Value |
|
| 88 |
|----------|-------|
|
| 89 |
-
| **Architecture** | RobertaForSequenceClassification (6 layers, 768 hidden,
|
| 90 |
| **Base Model** | [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) |
|
| 91 |
| **Task** | Multi-label classification (BCEWithLogitsLoss with class weights) |
|
| 92 |
| **Labels** | 31 (30 CWE categories + "safe") |
|
| 93 |
| **Max Sequence Length** | 512 tokens |
|
| 94 |
-
| **
|
| 95 |
|
| 96 |
## Supported Languages
|
| 97 |
|
|
@@ -99,82 +103,89 @@ Python, JavaScript, Java, C, C++, PHP, Go
|
|
| 99 |
|
| 100 |
The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
|
| 101 |
|
| 102 |
-
##
|
| 103 |
-
|
| 104 |
-
###
|
| 105 |
-
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
| 110 |
-
|
|
| 111 |
-
|
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
| 118 |
-
|-
|
| 119 |
-
| CWE-
|
| 120 |
-
| CWE-
|
| 121 |
-
| CWE-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
| CWE |
|
| 125 |
-
|-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
| CWE
|
| 129 |
-
|
|
| 130 |
-
| CWE-
|
| 131 |
-
| CWE-
|
| 132 |
-
| CWE-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
| CWE
|
| 136 |
-
|
|
| 137 |
-
| CWE-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
| CWE |
|
| 141 |
-
|-
|
| 142 |
-
| CWE-
|
| 143 |
-
| CWE-
|
| 144 |
-
| CWE-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
| CWE |
|
| 148 |
-
|-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
| CWE |
|
| 154 |
-
|-
|
| 155 |
-
| CWE-
|
| 156 |
-
|
| 157 |
-
### OWASP A10
|
| 158 |
-
| CWE | Name |
|
| 159 |
-
|-----|------|----------|
|
| 160 |
-
| CWE-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
|
| 167 |
-
|
|
| 168 |
-
| **
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
## Training Data
|
| 176 |
|
| 177 |
-
The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples),
|
| 178 |
|
| 179 |
1. **[BigVul](https://huggingface.co/datasets/bstee615/bigvul)** β 265K C/C++ vulnerable functions from real CVEs
|
| 180 |
2. **[CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul)** β Balanced CWE-labeled subset
|
|
@@ -185,49 +196,25 @@ The model was trained on the [code-security-vulnerability-dataset](https://huggi
|
|
| 185 |
|
| 186 |
| Parameter | Value |
|
| 187 |
|-----------|-------|
|
| 188 |
-
| Epochs | 2
|
| 189 |
| Batch Size | 8 |
|
| 190 |
| Learning Rate | 5e-5 |
|
| 191 |
| Scheduler | Cosine with warmup (50 steps) |
|
| 192 |
-
| Loss | BCEWithLogitsLoss (class-weighted, clipped
|
| 193 |
-
| Training Subset | 20K balanced samples
|
| 194 |
-
| Validation Subset | 3K samples |
|
| 195 |
| Optimizer | AdamW (fused) |
|
| 196 |
|
| 197 |
## Limitations
|
| 198 |
|
| 199 |
-
1. **Class imbalance**: Many rare CWE types have very few training examples
|
| 200 |
-
2. **Sequence length**: Limited to 512 tokens
|
| 201 |
-
3. **Language bias**: Strongest on C/C++ due to BigVul's dominance
|
| 202 |
-
4. **
|
| 203 |
-
5. **
|
| 204 |
-
6. **Not a replacement for manual review**: This model should complement, not replace, human security review and established SAST tools.
|
| 205 |
-
|
| 206 |
-
## Example Predictions
|
| 207 |
-
|
| 208 |
-
### SQL Injection (Python)
|
| 209 |
-
```python
|
| 210 |
-
query = f"SELECT * FROM users WHERE username = '{username}'"
|
| 211 |
-
cursor.execute(query)
|
| 212 |
-
# β CWE-89: SQL Injection (confidence: 0.85)
|
| 213 |
-
```
|
| 214 |
-
|
| 215 |
-
### Buffer Overflow (C)
|
| 216 |
-
```c
|
| 217 |
-
char buffer[64];
|
| 218 |
-
strcpy(buffer, user_input);
|
| 219 |
-
// β CWE-119: Buffer Overflow (confidence: 0.72)
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
### Safe Code
|
| 223 |
-
```python
|
| 224 |
-
cursor.execute("SELECT * FROM users WHERE username = ?", (username,))
|
| 225 |
-
# β safe (confidence: 0.94)
|
| 226 |
-
```
|
| 227 |
|
| 228 |
## Interactive Demo
|
| 229 |
|
| 230 |
-
Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) β paste any code and get a full security report with OWASP mapping, severity scores, and suggested fixes.
|
| 231 |
|
| 232 |
## Citation
|
| 233 |
|
|
@@ -238,4 +225,4 @@ Try the model in our [Code Security Analyzer Space](https://huggingface.co/space
|
|
| 238 |
year={2025},
|
| 239 |
url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
|
| 240 |
}
|
| 241 |
-
```
|
|
|
|
| 29 |
dataset:
|
| 30 |
type: ayshajavd/code-security-vulnerability-dataset
|
| 31 |
name: Code Security Vulnerability Dataset
|
| 32 |
+
split: test
|
| 33 |
metrics:
|
| 34 |
- type: f1
|
| 35 |
+
value: 0.8648
|
| 36 |
name: Weighted F1
|
| 37 |
- type: f1
|
| 38 |
+
value: 0.4575
|
| 39 |
name: Micro F1
|
| 40 |
- type: f1
|
| 41 |
+
value: 0.9501
|
| 42 |
+
name: F1 (safe class)
|
| 43 |
+
- type: recall
|
| 44 |
+
value: 0.5018
|
| 45 |
+
name: Macro Recall
|
| 46 |
---
|
| 47 |
|
| 48 |
# GraphCodeBERT Vulnerability Classifier
|
|
|
|
| 80 |
"CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
|
| 81 |
"CWE-787", "CWE-798", "CWE-918"]
|
| 82 |
|
| 83 |
+
threshold = 0.5
|
| 84 |
for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
|
| 85 |
if prob > threshold:
|
| 86 |
print(f"{cwe}: {prob:.3f}")
|
|
|
|
| 90 |
|
| 91 |
| Property | Value |
|
| 92 |
|----------|-------|
|
| 93 |
+
| **Architecture** | RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params) |
|
| 94 |
| **Base Model** | [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) |
|
| 95 |
| **Task** | Multi-label classification (BCEWithLogitsLoss with class weights) |
|
| 96 |
| **Labels** | 31 (30 CWE categories + "safe") |
|
| 97 |
| **Max Sequence Length** | 512 tokens |
|
| 98 |
+
| **Recommended Threshold** | 0.5 (balanced precision/recall) or 0.3 (high recall, security-first) |
|
| 99 |
|
| 100 |
## Supported Languages
|
| 101 |
|
|
|
|
| 103 |
|
| 104 |
The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
|
| 105 |
|
| 106 |
+
## Evaluation Results (Test Set β 5,000 samples)
|
| 107 |
+
|
| 108 |
+
### Threshold Comparison
|
| 109 |
+
|
| 110 |
+
| Threshold | Macro F1 | Micro F1 | Weighted F1 | Macro Precision | Macro Recall |
|
| 111 |
+
|-----------|----------|----------|-------------|-----------------|--------------|
|
| 112 |
+
| 0.2 | 0.066 | 0.301 | 0.859 | 0.048 | 0.562 |
|
| 113 |
+
| **0.3** | **0.081** | **0.458** | **0.865** | **0.057** | **0.502** |
|
| 114 |
+
| 0.4 | 0.101 | 0.626 | 0.870 | 0.070 | 0.439 |
|
| 115 |
+
| **0.5** | **0.125** | **0.739** | **0.870** | **0.088** | **0.366** |
|
| 116 |
+
|
| 117 |
+
### Per-Class Performance (threshold=0.3)
|
| 118 |
+
|
| 119 |
+
#### OWASP A01:2021 β Broken Access Control
|
| 120 |
+
| CWE | Name | Support | Precision | Recall | F1 |
|
| 121 |
+
|-----|------|---------|-----------|--------|-----|
|
| 122 |
+
| CWE-22 | Path Traversal | 2 | 0.000 | 0.000 | 0.000 |
|
| 123 |
+
| CWE-200 | Information Exposure | 30 | 0.063 | 0.800 | 0.117 |
|
| 124 |
+
| CWE-264 | Permissions/Privileges | 23 | 0.025 | 0.696 | 0.049 |
|
| 125 |
+
| CWE-269 | Improper Privilege Mgmt | 1 | 0.000 | 0.000 | 0.000 |
|
| 126 |
+
| CWE-276 | Incorrect Permissions | 0 | β | β | β |
|
| 127 |
+
| CWE-284 | Access Control | 5 | 0.000 | 0.000 | 0.000 |
|
| 128 |
+
| CWE-352 | CSRF | 1 | 0.000 | 0.000 | 0.000 |
|
| 129 |
+
| CWE-601 | Open Redirect | 0 | β | β | β |
|
| 130 |
+
|
| 131 |
+
#### OWASP A02:2021 β Cryptographic Failures
|
| 132 |
+
| CWE | Name | Support | Precision | Recall | F1 |
|
| 133 |
+
|-----|------|---------|-----------|--------|-----|
|
| 134 |
+
| CWE-310 | Cryptographic Issues | 5 | 0.000 | 0.000 | 0.000 |
|
| 135 |
+
| CWE-327 | Broken Crypto Algorithm | 1 | 0.000 | 0.000 | 0.000 |
|
| 136 |
+
| CWE-330 | Insufficient Randomness | 1 | 0.000 | 0.000 | 0.000 |
|
| 137 |
+
|
| 138 |
+
#### OWASP A03:2021 β Injection
|
| 139 |
+
| CWE | Name | Support | Precision | Recall | F1 |
|
| 140 |
+
|-----|------|---------|-----------|--------|-----|
|
| 141 |
+
| CWE-20 | Input Validation | 69 | 0.023 | **0.957** | 0.046 |
|
| 142 |
+
| CWE-78 | Command Injection | 1 | 0.011 | **1.000** | 0.021 |
|
| 143 |
+
| CWE-79 | XSS | 16 | 0.084 | **0.750** | 0.151 |
|
| 144 |
+
| CWE-89 | SQL Injection | 15 | 0.096 | **1.000** | 0.174 |
|
| 145 |
+
| CWE-94 | Code Injection | 27 | 0.123 | **1.000** | 0.220 |
|
| 146 |
+
| CWE-119 | Buffer Overflow | 118 | 0.088 | **0.898** | 0.160 |
|
| 147 |
+
| CWE-125 | Out-of-bounds Read | 35 | 0.048 | **0.829** | 0.091 |
|
| 148 |
+
| CWE-190 | Integer Overflow | 14 | 0.033 | **1.000** | 0.064 |
|
| 149 |
+
| CWE-401 | Memory Leak | 2 | 0.022 | **1.000** | 0.044 |
|
| 150 |
+
| CWE-416 | Use After Free | 20 | 0.048 | 0.400 | 0.086 |
|
| 151 |
+
| CWE-476 | NULL Pointer Deref | 30 | 0.032 | **0.867** | 0.061 |
|
| 152 |
+
| CWE-787 | Out-of-bounds Write | 46 | 0.052 | **0.891** | 0.099 |
|
| 153 |
+
|
| 154 |
+
#### OWASP A04:2021 β Insecure Design
|
| 155 |
+
| CWE | Name | Support | Precision | Recall | F1 |
|
| 156 |
+
|-----|------|---------|-----------|--------|-----|
|
| 157 |
+
| CWE-362 | Race Condition | 11 | 0.035 | 0.636 | 0.065 |
|
| 158 |
+
| CWE-399 | Resource Management | 21 | 0.008 | **0.857** | 0.015 |
|
| 159 |
+
| CWE-434 | File Upload | 0 | β | β | β |
|
| 160 |
+
|
| 161 |
+
#### OWASP A07βA10
|
| 162 |
+
| CWE | Name | Support | Precision | Recall | F1 |
|
| 163 |
+
|-----|------|---------|-----------|--------|-----|
|
| 164 |
+
| CWE-287 | Authentication | 0 | β | β | β |
|
| 165 |
+
| CWE-798 | Hardcoded Credentials | 0 | β | β | β |
|
| 166 |
+
| CWE-502 | Deserialization | 10 | 0.056 | **1.000** | 0.106 |
|
| 167 |
+
| CWE-918 | SSRF | 0 | β | β | β |
|
| 168 |
+
|
| 169 |
+
### Key Metric: Safe Code Detection
|
| 170 |
+
| Class | Support | Precision | Recall | F1 |
|
| 171 |
+
|-------|---------|-----------|--------|-----|
|
| 172 |
+
| **safe** | **4,496** | **0.927** | **0.975** | **0.950** |
|
| 173 |
+
|
| 174 |
+
### Model Strengths
|
| 175 |
+
- **Excellent recall** on many vulnerability classes (0.75β1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
|
| 176 |
+
- **Strong safe code detection** (F1=0.95) β reliably identifies secure code
|
| 177 |
+
- **High sensitivity** β at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)
|
| 178 |
+
|
| 179 |
+
### Model Limitations
|
| 180 |
+
- **Low precision on rare classes** β many false positives, especially on CWEs with few training examples
|
| 181 |
+
- Precision can be improved by using **threshold=0.5** (macro F1 improves to 0.125 but recall drops)
|
| 182 |
+
- Classes with 0 test support cannot be evaluated
|
| 183 |
+
|
| 184 |
+
> **Design choice:** For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).
|
| 185 |
|
| 186 |
## Training Data
|
| 187 |
|
| 188 |
+
The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples), combining:
|
| 189 |
|
| 190 |
1. **[BigVul](https://huggingface.co/datasets/bstee615/bigvul)** β 265K C/C++ vulnerable functions from real CVEs
|
| 191 |
2. **[CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul)** β Balanced CWE-labeled subset
|
|
|
|
| 196 |
|
| 197 |
| Parameter | Value |
|
| 198 |
|-----------|-------|
|
| 199 |
+
| Epochs | 2 |
|
| 200 |
| Batch Size | 8 |
|
| 201 |
| Learning Rate | 5e-5 |
|
| 202 |
| Scheduler | Cosine with warmup (50 steps) |
|
| 203 |
+
| Loss | BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x) |
|
| 204 |
+
| Training Subset | 20K balanced samples |
|
|
|
|
| 205 |
| Optimizer | AdamW (fused) |
|
| 206 |
|
| 207 |
## Limitations
|
| 208 |
|
| 209 |
+
1. **Class imbalance**: Many rare CWE types have very few training examples, leading to high false positive rates
|
| 210 |
+
2. **Sequence length**: Limited to 512 tokens β long functions may be truncated
|
| 211 |
+
3. **Language bias**: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
|
| 212 |
+
4. **Single-function analysis**: Analyzes individual functions, not cross-function or cross-file vulnerabilities
|
| 213 |
+
5. **Not a replacement**: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
## Interactive Demo
|
| 216 |
|
| 217 |
+
Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) β paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.
|
| 218 |
|
| 219 |
## Citation
|
| 220 |
|
|
|
|
| 225 |
year={2025},
|
| 226 |
url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
|
| 227 |
}
|
| 228 |
+
```
|