aidencary commited on
Commit
ad86b43
·
verified ·
1 Parent(s): 26acd1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -1
README.md CHANGED
@@ -4,4 +4,193 @@ language:
4
  - en
5
  base_model:
6
  - microsoft/codebert-base
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  base_model:
6
  - microsoft/codebert-base
7
+ pipeline_tag: text-classification
8
+ tags:
9
+ - code-quality
10
+ - bug-detection
11
+ - codebert
12
+ - python
13
+ ---
14
+ # Model Card for Model ID
15
+
16
+ <!-- Provide a quick summary of what the model is/does. -->
17
+
18
+ # codepulse-codebert
19
+
20
+ Fine-tuned binary classifier on top of `microsoft/codebert-base` that
21
+ scores code snippets by P(buggy). Used in the CodePulse analysis engine
22
+ as a confidence validator: it filters GPT-predicted bugs by checking
23
+ whether the flagged line is statistically likely to be buggy, reducing
24
+ false positives before they reach the end user.
25
+
26
+ ## Model Details
27
+
28
+ ### Model Description
29
+
30
+ CodePulse-CodeBERT is a binary sequence classifier fine-tuned from
31
+ `microsoft/codebert-base`. Given a short code snippet (typically one bug
32
+ line plus optional surrounding context), the model outputs a probability
33
+ that the snippet contains a bug. Predictions below a configurable
34
+ threshold are marked as low-confidence and excluded from the final
35
+ quality score.
36
+
37
+ - **Developed by:** Aiden Cary, Keller Willhite, Zachery Atchley
38
+ - **Model type:** Transformer-based binary sequence classifier
39
+ (CodeBERT fine-tune)
40
+ - **Language(s) (NLP):** Code (Python primary)
41
+ - **License:** MIT
42
+ - **Finetuned from model:**
43
+ [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
44
+
45
+ ### Model Sources
46
+
47
+ - **Repository:** https://github.com/aidencary/CodePulse
48
+
49
+ ## Uses
50
+
51
+ ### Direct Use
52
+
53
+ Classify short code snippets as buggy or not buggy:
54
+
55
+ ``` python
56
+ from transformers import pipeline
57
+
58
+ clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
59
+ result = clf("return user_list[index]")
60
+ # [{'label': 'buggy', 'score': 0.87}]
61
+ ```
62
+
63
+ ### Downstream Use
64
+
65
+ Integrated into the CodePulse backend
66
+ (`app/services/codebert_validator.py`) as a post-processing layer over
67
+ GPT-generated bug predictions. Each predicted bug line is extracted,
68
+ comment-stripped, and scored. Bugs whose P(buggy) falls below the
69
+ configured threshold are flagged and excluded from the penalty applied
70
+ to the code quality score.
71
+
72
+ ### Out-of-Scope Use
73
+
74
+ - Full-file classification --- model expects single-line or
75
+ short-window snippets (≤512 tokens). Long inputs are truncated.
76
+ - Languages other than Python --- training data was Python-focused;
77
+ results on other languages are unreliable.
78
+ - Security vulnerability detection --- trained for general bug
79
+ patterns, not security-specific flaws (SQLi, XSS, etc.).
80
+ - Production safety gate without human review --- false negative rate
81
+ is non-zero.
82
+
83
+ ## Bias, Risks, and Limitations
84
+
85
+ - Training data skews toward certain bug patterns; rare bug types will
86
+ have lower recall.
87
+ - Comment stripping is applied at inference time (inline `# ...`
88
+ comments are removed before scoring) to prevent label leakage from
89
+ annotated datasets. Code with semantically meaningful comments may
90
+ lose signal.
91
+ - Confidence contrast remapping is applied in the CodePulse pipeline
92
+ --- raw model probabilities are spread apart via a sigmoid transform
93
+ before thresholding. Direct use of the model outside that pipeline
94
+ will see unmodified softmax probabilities.
95
+
96
+ ## Recommendations
97
+
98
+ Use P(buggy) as a soft signal, not a hard gate. Combine with static
99
+ analysis or human review for critical codepaths.
100
+
101
+ ## How to Get Started with the Model
102
+
103
+ ``` python
104
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
105
+ import torch
106
+ import torch.nn.functional as F
107
+
108
+ tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
109
+ model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
110
+ model.eval()
111
+
112
+ snippet = "items[i] = value"
113
+ inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
114
+ with torch.no_grad():
115
+ logits = model(**inputs).logits
116
+ p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
117
+ print(f"P(buggy): {p_buggy:.3f}")
118
+ ```
119
+
120
+ ## Training Details
121
+
122
+ ### Training Data
123
+
124
+ Fine-tuned on labeled code snippets where each sample is a short code
125
+ line or block annotated as buggy or clean. Training data sourced from
126
+ public bug datasets and synthetic bug injection into clean Python code.
127
+
128
+ ### Training Procedure
129
+
130
+ #### Preprocessing
131
+
132
+ - Inline `#` comments stripped to prevent label leakage
133
+ - Common leading indentation removed (dedented to column 0)
134
+ - Tokenized with microsoft/codebert-base tokenizer, max length 512
135
+
136
+ #### Training Hyperparameters
137
+
138
+ - Training regime: fp32
139
+ - Base model: microsoft/codebert-base
140
+ - Task head: AutoModelForSequenceClassification (2 labels)
141
+
142
+ ## Evaluation
143
+
144
+ ### Testing Data, Factors & Metrics
145
+
146
+ #### Testing Data
147
+
148
+ Held-out split from the same labeled snippet dataset used for training.
149
+
150
+ #### Metrics
151
+
152
+ - Accuracy
153
+ - F1 (macro)
154
+ - P(buggy) calibration --- model confidence should correlate with
155
+ actual bug rate
156
+
157
+ #### Results
158
+
159
+ Metric Value
160
+ ------------ ---------------
161
+ Accuracy \[add yours\]
162
+ F1 (macro) \[add yours\]
163
+
164
+ ### Summary
165
+
166
+ Model performs well on Python snippets matching training distribution.
167
+ Performance degrades on heavily commented code (comments stripped at
168
+ inference) and on languages outside the training set.
169
+
170
+ ## Technical Specifications
171
+
172
+ ### Model Architecture and Objective
173
+
174
+ RobertaForSequenceClassification (CodeBERT backbone) with a 2-class
175
+ classification head. Objective: binary cross-entropy, labels = {clean,
176
+ buggy}.
177
+
178
+ ### Compute Infrastructure
179
+
180
+ #### Hardware
181
+
182
+ Consumer GPU (training)
183
+
184
+ #### Software
185
+
186
+ - transformers
187
+ - torch
188
+ - Python 3.11+
189
+
190
+ ## Model Card Authors
191
+
192
+ Aiden Cary, Keller Willhite, Zachery Atchley
193
+
194
+ ## Model Card Contact
195
+
196
+ aiden4786@gmail.com