ayshajavd commited on
Commit
82b783c
Β·
verified Β·
1 Parent(s): 47c3141

Update model card with full test set evaluation metrics (5K samples)

Browse files
Files changed (1) hide show
  1. README.md +101 -114
README.md CHANGED
@@ -29,16 +29,20 @@ model-index:
29
  dataset:
30
  type: ayshajavd/code-security-vulnerability-dataset
31
  name: Code Security Vulnerability Dataset
 
32
  metrics:
33
  - type: f1
34
- value: 0.8779
35
  name: Weighted F1
36
  - type: f1
37
- value: 0.7043
38
  name: Micro F1
39
  - type: f1
40
- value: 0.1157
41
- name: Macro F1
 
 
 
42
  ---
43
 
44
  # GraphCodeBERT Vulnerability Classifier
@@ -76,7 +80,7 @@ TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94
76
  "CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
77
  "CWE-787", "CWE-798", "CWE-918"]
78
 
79
- threshold = 0.3
80
  for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
81
  if prob > threshold:
82
  print(f"{cwe}: {prob:.3f}")
@@ -86,12 +90,12 @@ for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
86
 
87
  | Property | Value |
88
  |----------|-------|
89
- | **Architecture** | RobertaForSequenceClassification (6 layers, 768 hidden, 82M params) |
90
  | **Base Model** | [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) |
91
  | **Task** | Multi-label classification (BCEWithLogitsLoss with class weights) |
92
  | **Labels** | 31 (30 CWE categories + "safe") |
93
  | **Max Sequence Length** | 512 tokens |
94
- | **Detection Threshold** | 0.3 (optimized for recall β€” missing a vulnerability is worse than a false positive) |
95
 
96
  ## Supported Languages
97
 
@@ -99,82 +103,89 @@ Python, JavaScript, Java, C, C++, PHP, Go
99
 
100
  The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
101
 
102
- ## Vulnerability Classes
103
-
104
- ### OWASP A01:2021 β€” Broken Access Control
105
- | CWE | Name | F1 Score |
106
- |-----|------|----------|
107
- | CWE-22 | Path Traversal | 0.000 |
108
- | CWE-200 | Information Exposure | 0.000 |
109
- | CWE-264 | Permissions/Privileges | 0.000 |
110
- | CWE-269 | Improper Privilege Management | 0.000 |
111
- | CWE-276 | Incorrect Default Permissions | 0.000 |
112
- | CWE-284 | Improper Access Control | 0.000 |
113
- | CWE-352 | CSRF | 0.000 |
114
- | CWE-601 | Open Redirect | 0.000 |
115
-
116
- ### OWASP A02:2021 β€” Cryptographic Failures
117
- | CWE | Name | F1 Score |
118
- |-----|------|----------|
119
- | CWE-310 | Cryptographic Issues | 0.000 |
120
- | CWE-327 | Broken Crypto Algorithm | 0.000 |
121
- | CWE-330 | Insufficient Randomness | 0.000 |
122
-
123
- ### OWASP A03:2021 β€” Injection
124
- | CWE | Name | F1 Score |
125
- |-----|------|----------|
126
- | CWE-20 | Improper Input Validation | 0.031 |
127
- | CWE-78 | OS Command Injection | 0.000 |
128
- | CWE-79 | Cross-Site Scripting (XSS) | 0.000 |
129
- | CWE-89 | SQL Injection | 0.600 |
130
- | CWE-94 | Code Injection | 0.435 |
131
- | CWE-119 | Buffer Overflow | 0.129 |
132
- | CWE-125 | Out-of-bounds Read | 0.133 |
133
- | CWE-190 | Integer Overflow | 0.400 |
134
- | CWE-401 | Memory Leak | 0.000 |
135
- | CWE-416 | Use After Free | 0.000 |
136
- | CWE-476 | NULL Pointer Dereference | 0.211 |
137
- | CWE-787 | Out-of-bounds Write | 0.233 |
138
-
139
- ### OWASP A04:2021 β€” Insecure Design
140
- | CWE | Name | F1 Score |
141
- |-----|------|----------|
142
- | CWE-362 | Race Condition | 0.000 |
143
- | CWE-399 | Resource Management Errors | 0.182 |
144
- | CWE-434 | Unrestricted File Upload | 0.000 |
145
-
146
- ### OWASP A07:2021 β€” Identification & Authentication Failures
147
- | CWE | Name | F1 Score |
148
- |-----|------|----------|
149
- | CWE-287 | Improper Authentication | 0.000 |
150
- | CWE-798 | Hardcoded Credentials | 0.000 |
151
-
152
- ### OWASP A08:2021 β€” Software & Data Integrity Failures
153
- | CWE | Name | F1 Score |
154
- |-----|------|----------|
155
- | CWE-502 | Insecure Deserialization | 0.286 |
156
-
157
- ### OWASP A10:2021 β€” Server-Side Request Forgery
158
- | CWE | Name | F1 Score |
159
- |-----|------|----------|
160
- | CWE-918 | SSRF | 0.000 |
161
-
162
- ### Overall Metrics
163
-
164
- | Metric | Value |
165
- |--------|-------|
166
- | **Weighted F1** | 0.878 |
167
- | **Micro F1** | 0.704 |
168
- | **Macro F1** | 0.116 |
169
- | **F1 (safe class)** | 0.946 |
170
- | **Macro Precision** | 0.087 |
171
- | **Macro Recall** | 0.276 |
172
-
173
- > **Note on Macro F1:** The low macro F1 is primarily due to extreme class imbalance β€” many CWE categories have <5 samples in the validation set, resulting in 0.0 F1 for those classes. The model performs well on classes with sufficient training data (SQL Injection: 0.60, Code Injection: 0.43, Integer Overflow: 0.40). Weighted F1 (0.878) better reflects real-world performance.
 
 
 
 
 
 
 
174
 
175
  ## Training Data
176
 
177
- The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples), a curated combination of:
178
 
179
  1. **[BigVul](https://huggingface.co/datasets/bstee615/bigvul)** β€” 265K C/C++ vulnerable functions from real CVEs
180
  2. **[CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul)** β€” Balanced CWE-labeled subset
@@ -185,49 +196,25 @@ The model was trained on the [code-security-vulnerability-dataset](https://huggi
185
 
186
  | Parameter | Value |
187
  |-----------|-------|
188
- | Epochs | 2 (initial) |
189
  | Batch Size | 8 |
190
  | Learning Rate | 5e-5 |
191
  | Scheduler | Cosine with warmup (50 steps) |
192
- | Loss | BCEWithLogitsLoss (class-weighted, clipped at 30x) |
193
- | Training Subset | 20K balanced samples (10K safe + 10K vulnerable) |
194
- | Validation Subset | 3K samples |
195
  | Optimizer | AdamW (fused) |
196
 
197
  ## Limitations
198
 
199
- 1. **Class imbalance**: Many rare CWE types have very few training examples. The model struggles with CWEs that have <50 training samples.
200
- 2. **Sequence length**: Limited to 512 tokens. Vulnerabilities spanning long functions may be missed.
201
- 3. **Language bias**: Strongest on C/C++ due to BigVul's dominance in training data. Performance on Go and PHP may be lower.
202
- 4. **Context-dependent vulns**: The model analyzes individual functions, not cross-function or cross-file vulnerabilities.
203
- 5. **False negatives**: The 0.3 threshold prioritizes sensitivity, but novel vulnerability patterns not seen in training may be missed.
204
- 6. **Not a replacement for manual review**: This model should complement, not replace, human security review and established SAST tools.
205
-
206
- ## Example Predictions
207
-
208
- ### SQL Injection (Python)
209
- ```python
210
- query = f"SELECT * FROM users WHERE username = '{username}'"
211
- cursor.execute(query)
212
- # β†’ CWE-89: SQL Injection (confidence: 0.85)
213
- ```
214
-
215
- ### Buffer Overflow (C)
216
- ```c
217
- char buffer[64];
218
- strcpy(buffer, user_input);
219
- // β†’ CWE-119: Buffer Overflow (confidence: 0.72)
220
- ```
221
-
222
- ### Safe Code
223
- ```python
224
- cursor.execute("SELECT * FROM users WHERE username = ?", (username,))
225
- # β†’ safe (confidence: 0.94)
226
- ```
227
 
228
  ## Interactive Demo
229
 
230
- Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) β€” paste any code and get a full security report with OWASP mapping, severity scores, and suggested fixes.
231
 
232
  ## Citation
233
 
@@ -238,4 +225,4 @@ Try the model in our [Code Security Analyzer Space](https://huggingface.co/space
238
  year={2025},
239
  url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
240
  }
241
- ```
 
29
  dataset:
30
  type: ayshajavd/code-security-vulnerability-dataset
31
  name: Code Security Vulnerability Dataset
32
+ split: test
33
  metrics:
34
  - type: f1
35
+ value: 0.8648
36
  name: Weighted F1
37
  - type: f1
38
+ value: 0.4575
39
  name: Micro F1
40
  - type: f1
41
+ value: 0.9501
42
+ name: F1 (safe class)
43
+ - type: recall
44
+ value: 0.5018
45
+ name: Macro Recall
46
  ---
47
 
48
  # GraphCodeBERT Vulnerability Classifier
 
80
  "CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
81
  "CWE-787", "CWE-798", "CWE-918"]
82
 
83
+ threshold = 0.5
84
  for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
85
  if prob > threshold:
86
  print(f"{cwe}: {prob:.3f}")
 
90
 
91
  | Property | Value |
92
  |----------|-------|
93
+ | **Architecture** | RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params) |
94
  | **Base Model** | [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) |
95
  | **Task** | Multi-label classification (BCEWithLogitsLoss with class weights) |
96
  | **Labels** | 31 (30 CWE categories + "safe") |
97
  | **Max Sequence Length** | 512 tokens |
98
+ | **Recommended Threshold** | 0.5 (balanced precision/recall) or 0.3 (high recall, security-first) |
99
 
100
  ## Supported Languages
101
 
 
103
 
104
  The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
105
 
106
+ ## Evaluation Results (Test Set β€” 5,000 samples)
107
+
108
+ ### Threshold Comparison
109
+
110
+ | Threshold | Macro F1 | Micro F1 | Weighted F1 | Macro Precision | Macro Recall |
111
+ |-----------|----------|----------|-------------|-----------------|--------------|
112
+ | 0.2 | 0.066 | 0.301 | 0.859 | 0.048 | 0.562 |
113
+ | **0.3** | **0.081** | **0.458** | **0.865** | **0.057** | **0.502** |
114
+ | 0.4 | 0.101 | 0.626 | 0.870 | 0.070 | 0.439 |
115
+ | **0.5** | **0.125** | **0.739** | **0.870** | **0.088** | **0.366** |
116
+
117
+ ### Per-Class Performance (threshold=0.3)
118
+
119
+ #### OWASP A01:2021 β€” Broken Access Control
120
+ | CWE | Name | Support | Precision | Recall | F1 |
121
+ |-----|------|---------|-----------|--------|-----|
122
+ | CWE-22 | Path Traversal | 2 | 0.000 | 0.000 | 0.000 |
123
+ | CWE-200 | Information Exposure | 30 | 0.063 | 0.800 | 0.117 |
124
+ | CWE-264 | Permissions/Privileges | 23 | 0.025 | 0.696 | 0.049 |
125
+ | CWE-269 | Improper Privilege Mgmt | 1 | 0.000 | 0.000 | 0.000 |
126
+ | CWE-276 | Incorrect Permissions | 0 | β€” | β€” | β€” |
127
+ | CWE-284 | Access Control | 5 | 0.000 | 0.000 | 0.000 |
128
+ | CWE-352 | CSRF | 1 | 0.000 | 0.000 | 0.000 |
129
+ | CWE-601 | Open Redirect | 0 | β€” | β€” | β€” |
130
+
131
+ #### OWASP A02:2021 β€” Cryptographic Failures
132
+ | CWE | Name | Support | Precision | Recall | F1 |
133
+ |-----|------|---------|-----------|--------|-----|
134
+ | CWE-310 | Cryptographic Issues | 5 | 0.000 | 0.000 | 0.000 |
135
+ | CWE-327 | Broken Crypto Algorithm | 1 | 0.000 | 0.000 | 0.000 |
136
+ | CWE-330 | Insufficient Randomness | 1 | 0.000 | 0.000 | 0.000 |
137
+
138
+ #### OWASP A03:2021 β€” Injection
139
+ | CWE | Name | Support | Precision | Recall | F1 |
140
+ |-----|------|---------|-----------|--------|-----|
141
+ | CWE-20 | Input Validation | 69 | 0.023 | **0.957** | 0.046 |
142
+ | CWE-78 | Command Injection | 1 | 0.011 | **1.000** | 0.021 |
143
+ | CWE-79 | XSS | 16 | 0.084 | **0.750** | 0.151 |
144
+ | CWE-89 | SQL Injection | 15 | 0.096 | **1.000** | 0.174 |
145
+ | CWE-94 | Code Injection | 27 | 0.123 | **1.000** | 0.220 |
146
+ | CWE-119 | Buffer Overflow | 118 | 0.088 | **0.898** | 0.160 |
147
+ | CWE-125 | Out-of-bounds Read | 35 | 0.048 | **0.829** | 0.091 |
148
+ | CWE-190 | Integer Overflow | 14 | 0.033 | **1.000** | 0.064 |
149
+ | CWE-401 | Memory Leak | 2 | 0.022 | **1.000** | 0.044 |
150
+ | CWE-416 | Use After Free | 20 | 0.048 | 0.400 | 0.086 |
151
+ | CWE-476 | NULL Pointer Deref | 30 | 0.032 | **0.867** | 0.061 |
152
+ | CWE-787 | Out-of-bounds Write | 46 | 0.052 | **0.891** | 0.099 |
153
+
154
+ #### OWASP A04:2021 β€” Insecure Design
155
+ | CWE | Name | Support | Precision | Recall | F1 |
156
+ |-----|------|---------|-----------|--------|-----|
157
+ | CWE-362 | Race Condition | 11 | 0.035 | 0.636 | 0.065 |
158
+ | CWE-399 | Resource Management | 21 | 0.008 | **0.857** | 0.015 |
159
+ | CWE-434 | File Upload | 0 | β€” | β€” | β€” |
160
+
161
+ #### OWASP A07–A10
162
+ | CWE | Name | Support | Precision | Recall | F1 |
163
+ |-----|------|---------|-----------|--------|-----|
164
+ | CWE-287 | Authentication | 0 | β€” | β€” | β€” |
165
+ | CWE-798 | Hardcoded Credentials | 0 | β€” | β€” | β€” |
166
+ | CWE-502 | Deserialization | 10 | 0.056 | **1.000** | 0.106 |
167
+ | CWE-918 | SSRF | 0 | β€” | β€” | β€” |
168
+
169
+ ### Key Metric: Safe Code Detection
170
+ | Class | Support | Precision | Recall | F1 |
171
+ |-------|---------|-----------|--------|-----|
172
+ | **safe** | **4,496** | **0.927** | **0.975** | **0.950** |
173
+
174
+ ### Model Strengths
175
+ - **Excellent recall** on many vulnerability classes (0.75–1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
176
+ - **Strong safe code detection** (F1=0.95) β€” reliably identifies secure code
177
+ - **High sensitivity** β€” at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)
178
+
179
+ ### Model Limitations
180
+ - **Low precision on rare classes** β€” many false positives, especially on CWEs with few training examples
181
+ - Precision can be improved by using **threshold=0.5** (macro F1 improves to 0.125 but recall drops)
182
+ - Classes with 0 test support cannot be evaluated
183
+
184
+ > **Design choice:** For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).
185
 
186
  ## Training Data
187
 
188
+ The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples), combining:
189
 
190
  1. **[BigVul](https://huggingface.co/datasets/bstee615/bigvul)** β€” 265K C/C++ vulnerable functions from real CVEs
191
  2. **[CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul)** β€” Balanced CWE-labeled subset
 
196
 
197
  | Parameter | Value |
198
  |-----------|-------|
199
+ | Epochs | 2 |
200
  | Batch Size | 8 |
201
  | Learning Rate | 5e-5 |
202
  | Scheduler | Cosine with warmup (50 steps) |
203
+ | Loss | BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x) |
204
+ | Training Subset | 20K balanced samples |
 
205
  | Optimizer | AdamW (fused) |
206
 
207
  ## Limitations
208
 
209
+ 1. **Class imbalance**: Many rare CWE types have very few training examples, leading to high false positive rates
210
+ 2. **Sequence length**: Limited to 512 tokens β€” long functions may be truncated
211
+ 3. **Language bias**: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
212
+ 4. **Single-function analysis**: Analyzes individual functions, not cross-function or cross-file vulnerabilities
213
+ 5. **Not a replacement**: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
 
215
  ## Interactive Demo
216
 
217
+ Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) β€” paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.
218
 
219
  ## Citation
220
 
 
225
  year={2025},
226
  url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
227
  }
228
+ ```