scthornton commited on
Commit
23afc6c
·
verified ·
1 Parent(s): 832b49f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -629
README.md CHANGED
@@ -1,704 +1,207 @@
1
  ---
2
- license: apache-2.0
3
  base_model: google/codegemma-7b-it
4
  tags:
5
- - code
6
- - security
7
- - codegemma
8
- - google
9
- - securecode
10
- - owasp
11
- - vulnerability-detection
 
 
 
12
  datasets:
13
- - scthornton/securecode-v2
14
- language:
15
- - en
16
- library_name: transformers
17
  pipeline_tag: text-generation
18
- arxiv: 2512.18542
 
 
19
  ---
20
 
21
- # CodeGemma 7B - SecureCode Edition
22
 
23
  <div align="center">
24
 
25
- [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
26
- [![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
27
- [![Base Model](https://img.shields.io/badge/base-CodeGemma%207B-orange.svg)](https://huggingface.co/google/codegemma-7b-it)
28
- [![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)
29
-
30
- **🔷 Google's code model enhanced with security expertise**
31
 
32
- Exceptional instruction following meets security awareness. Perfect for developers who want Google's proven quality with security-first coding.
33
 
34
- [📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Hub](https://huggingface.co/scthornton/codegemma-7b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai) | [📚 Collection](https://huggingface.co/collections/scthornton/securecode)
35
 
36
  </div>
37
 
38
  ---
39
 
40
- ## 🎯 Quick Decision Guide
41
 
42
- **Choose This Model If:**
43
- - ✅ You value **Google brand trust** and proven quality
44
- - ✅ You need **excellent instruction following** for complex security tasks
45
- - ✅ You want **strong code completion** with security awareness
46
- - ✅ You're building on **Google Cloud Platform** or Google ecosystem
47
- - ✅ You need **reliable, consistent responses** from a proven architecture
48
- - ✅ You prefer **7B efficiency** with Google's engineering quality
49
 
50
- **Consider Other Models If:**
51
- - ⚠️ You need maximum context window (→ Qwen 7B/14B with 128K)
52
- - ⚠️ You're on very limited hardware (→ Llama 3B)
53
- - ⚠️ You need enterprise brand diversity (→ IBM Granite, Meta CodeLlama)
54
- - ⚠️ You want absolute best code understanding (→ Qwen 7B slightly edges out)
55
 
56
- ---
57
-
58
- ## 📊 Collection Positioning
59
 
60
- | Model | Size | Best For | Hardware | Inference Speed | Unique Strength |
61
- |-------|------|----------|----------|-----------------|-----------------|
62
- | Llama 3.2 3B | 3B | Consumer deployment | 8GB RAM | ⚡⚡⚡ Fastest | Most accessible |
63
- | DeepSeek 6.7B | 6.7B | Security-optimized baseline | 16GB RAM | ⚡⚡ Fast | Security architecture |
64
- | Qwen 7B | 7B | Best code understanding | 16GB RAM | ⚡⚡ Fast | Best-in-class 7B |
65
- | **CodeGemma 7B** | **7B** | **Google ecosystem** | **16GB RAM** | **⚡⚡ Fast** | **Instruction following, Google quality** |
66
- | CodeLlama 13B | 13B | Enterprise trust | 24GB RAM | ⚡ Medium | Meta brand, proven |
67
- | Qwen 14B | 14B | Advanced analysis | 32GB RAM | ⚡ Medium | 128K context window |
68
- | StarCoder2 15B | 15B | Multi-language specialist | 32GB RAM | ⚡ Medium | 600+ languages |
69
- | Granite 20B | 20B | Enterprise-scale | 48GB RAM | Medium | IBM trust, largest |
70
 
71
- **This Model's Sweet Spot:** Google quality + security expertise. Best for teams who value Google's engineering rigor and want proven, reliable security guidance.
 
 
 
 
 
 
 
 
 
 
72
 
73
- ---
74
 
75
- ## 🚨 The Problem This Solves
76
-
77
- **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). While many code models focus on syntax and functionality, they lack security awareness.
78
-
79
- **Real-world costs:**
80
- - **Equifax** (SQL injection): $425 million settlement + brand destruction
81
- - **Capital One** (SSRF): 100 million customer records, $80M fine
82
- - **SolarWinds** (authentication bypass): 18,000 organizations compromised
83
- - **LastPass** (cryptographic failures): 30 million users affected
84
-
85
- CodeGemma SecureCode Edition brings Google's renowned engineering quality to secure coding, combining reliable instruction following with comprehensive security knowledge.
86
-
87
- ---
88
-
89
- ## 💡 What is This?
90
-
91
- This is **Google CodeGemma 7B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - Google's specialized code model enhanced with production-grade security expertise covering the complete OWASP Top 10:2025.
92
-
93
- CodeGemma is part of Google's Gemma family, built on the same technology powering Google's AI products. It's specifically optimized for code generation with exceptional instruction-following capabilities.
94
-
95
- Combined with SecureCode training, this model delivers:
96
-
97
- ✅ **Excellent instruction following** - Reliably follows complex security requirements
98
- ✅ **Google engineering quality** - Proven architecture from Google AI
99
- ✅ **Strong code completion** - Exceptional at completing partial secure code
100
- ✅ **Consistent, reliable responses** - Predictable behavior for production use
101
- ✅ **Security-first code generation** - Trained on real vulnerability patterns
102
-
103
- **The Result:** A code assistant that combines Google's quality with security expertise.
104
-
105
- **Why CodeGemma 7B?** This model offers Google's advantages:
106
- - 🔷 **Google brand trust** - Built by the team behind TensorFlow, BERT, and PaLM
107
- - 🎯 **Instruction-following excellence** - Consistently follows complex security specifications
108
- - ⚡ **Production efficiency** - 7B parameters = fast inference
109
- - 🌍 **Broad language support** - Code generation across major languages
110
- - 🏢 **GCP integration** - Optimized for Google Cloud Platform deployment
111
- - ⚖️ **Apache 2.0 licensed** - Full commercial freedom
112
-
113
- Perfect for development teams using Google Cloud, organizations valuing Google's engineering culture, and developers who prioritize instruction-following reliability.
114
-
115
- ---
116
-
117
- ## 🔐 Security Training Coverage
118
-
119
- ### Real-World Vulnerability Distribution
120
-
121
- Trained on 1,209 security examples with real CVE grounding:
122
-
123
- | OWASP Category | Examples | Real Incidents |
124
- |----------------|----------|----------------|
125
- | **Broken Access Control** | 224 | Equifax, Facebook, Uber |
126
- | **Authentication Failures** | 199 | SolarWinds, Okta, LastPass |
127
- | **Injection Attacks** | 125 | Capital One, Yahoo, LinkedIn |
128
- | **Cryptographic Failures** | 115 | LastPass, Adobe, Dropbox |
129
- | **Security Misconfiguration** | 98 | Tesla, MongoDB, Elasticsearch |
130
- | **Vulnerable Components** | 87 | Log4Shell, Heartbleed, Struts |
131
- | **Identification/Auth Failures** | 84 | Twitter, GitHub, Reddit |
132
- | **Software/Data Integrity** | 78 | SolarWinds, Codecov, npm |
133
- | **Logging Failures** | 71 | Various incident responses |
134
- | **SSRF** | 69 | Capital One, Shopify |
135
- | **Insecure Design** | 59 | Architectural flaws |
136
-
137
- ### Multi-Language Support
138
-
139
- Fine-tuned on security examples across:
140
- - **Python** (Django, Flask, FastAPI) - 280 examples
141
- - **JavaScript/TypeScript** (Express, NestJS, React) - 245 examples
142
- - **Java** (Spring Boot) - 178 examples
143
- - **Go** (Gin framework) - 145 examples
144
- - **PHP** (Laravel, Symfony) - 112 examples
145
- - **C#** (ASP.NET Core) - 89 examples
146
- - **Ruby** (Rails) - 67 examples
147
- - **Rust** (Actix, Rocket) - 45 examples
148
- - **C/C++** (Memory safety) - 28 examples
149
- - **Kotlin, Swift** - 20 examples
150
-
151
- ---
152
-
153
- ## 🎯 Deployment Scenarios
154
-
155
- ### Scenario 1: Google Cloud Platform Integration
156
-
157
- **Native integration with GCP services.**
158
-
159
- **Platform:** Google Cloud Run, Vertex AI, GKE
160
- **Hardware:** Cloud TPU, NVIDIA T4/A100
161
- **Use Case:** Serverless security code generation
162
-
163
- **GCP Benefits:**
164
- - Optimized for Google Cloud infrastructure
165
- - Seamless Vertex AI integration
166
- - Cloud Run auto-scaling
167
- - Integrated monitoring and logging
168
-
169
- **ROI:** Reduced deployment complexity on GCP. Natural fit for Google-first organizations.
170
-
171
- ---
172
-
173
- ### Scenario 2: Secure API Code Generation
174
-
175
- **Generate production-ready secure APIs with precise specifications.**
176
-
177
- **Hardware:** Standard cloud instance (16GB RAM)
178
- **Use Case:** API security automation
179
- **Strength:** Follows detailed security requirements precisely
180
-
181
- **Example Use Case:**
182
- ```
183
- Generate a secure REST API for user authentication with:
184
- - JWT tokens (RS256)
185
- - Refresh token rotation
186
- - Rate limiting (10 req/min per IP)
187
- - Comprehensive audit logging
188
- - CSRF protection
189
- ```
190
-
191
- **Instruction Following:** CodeGemma reliably implements ALL specified requirements, not just some.
192
-
193
- ---
194
-
195
- ### Scenario 3: Code Review Copilot
196
-
197
- **Real-time security suggestions during code review.**
198
-
199
- **Platform:** GitHub Copilot alternative, IDE plugins
200
- **Latency:** <100ms for inline suggestions
201
- **Use Case:** Security-aware code completion
202
-
203
- **Value Proposition:**
204
- - Suggests secure patterns as developers type
205
- - Catches vulnerabilities during development
206
- - Educates developers on security best practices
207
- - Reduces security debt accumulation
208
-
209
- ---
210
-
211
- ### Scenario 4: Educational Platform
212
-
213
- **Teaching secure coding with Google-quality foundations.**
214
-
215
- **Audience:** CS students, bootcamp students, junior developers
216
- **Platform:** Interactive coding platforms
217
- **Use Case:** Security education at scale
218
-
219
- **Educational Benefits:**
220
- - Google brand credibility for students
221
- - Consistent, predictable teaching responses
222
- - Clear explanations of security concepts
223
- - Reliable code examples
224
-
225
- ---
226
-
227
- ## 📊 Training Details
228
-
229
- | Parameter | Value | Why This Matters |
230
- |-----------|-------|------------------|
231
- | **Base Model** | google/codegemma-7b-it | Google's instruction-tuned code model |
232
- | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | Efficient training, preserves base capabilities |
233
- | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) | 100% incident-grounded, expert-validated |
234
- | **Dataset Size** | 841 training examples | Focused on quality over quantity |
235
- | **Training Epochs** | 3 | Optimal convergence without overfitting |
236
- | **LoRA Rank (r)** | 16 | Balanced parameter efficiency |
237
- | **LoRA Alpha** | 32 | Learning rate scaling factor |
238
- | **Learning Rate** | 2e-4 | Standard for LoRA fine-tuning |
239
- | **Quantization** | 4-bit (bitsandbytes) | Enables efficient training |
240
- | **Trainable Parameters** | ~40M (0.57% of 7B total) | Minimal parameters, maximum impact |
241
- | **Total Parameters** | 7B | Sweet spot for efficiency |
242
- | **Context Window** | 8K tokens | Standard for code analysis |
243
- | **GPU Used** | NVIDIA A100 40GB | Enterprise training infrastructure |
244
- | **Training Time** | ~6 hours (estimated) | Efficient training cycle |
245
-
246
- ### Training Methodology
247
-
248
- **LoRA (Low-Rank Adaptation)** preserves CodeGemma's instruction-following capabilities:
249
- 1. **Efficiency:** Trains only 0.57% of model parameters (40M vs 7B)
250
- 2. **Quality:** Maintains Google's exceptional code generation
251
- 3. **Reliability:** Preserves consistent, predictable behavior
252
-
253
- **Google Gemma Foundation:** Built on Google's cutting-edge AI research:
254
- - State-of-the-art instruction following
255
- - Optimized for code generation tasks
256
- - Proven reliability in production
257
- - Backed by Google AI engineering
258
-
259
- ---
260
-
261
- ## 🚀 Usage
262
-
263
- ### Quick Start
264
-
265
- ```python
266
- from transformers import AutoModelForCausalLM, AutoTokenizer
267
- from peft import PeftModel
268
-
269
- # Load Google CodeGemma base model
270
- base_model = "google/codegemma-7b-it"
271
- model = AutoModelForCausalLM.from_pretrained(
272
- base_model,
273
- device_map="auto",
274
- torch_dtype="auto",
275
- trust_remote_code=True
276
- )
277
- tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
278
-
279
- # Load SecureCode LoRA adapter
280
- model = PeftModel.from_pretrained(model, "scthornton/codegemma-7b-securecode")
281
-
282
- # Generate secure code with precise requirements
283
- prompt = """### User:
284
- Generate a secure user registration endpoint in Python Flask with these exact requirements:
285
- 1. Email validation with regex
286
- 2. Password: minimum 12 chars, complexity requirements
287
- 3. Bcrypt hashing (cost factor 12)
288
- 4. Rate limiting: 5 attempts per 15 minutes per IP
289
- 5. CSRF token validation
290
- 6. SQL injection prevention via parameterized queries
291
- 7. Comprehensive audit logging to Stackdriver
292
- 8. Return JSON with proper status codes
293
-
294
- ### Assistant:
295
- """
296
-
297
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
298
- outputs = model.generate(
299
- **inputs,
300
- max_new_tokens=2048,
301
- temperature=0.7,
302
- top_p=0.95,
303
- do_sample=True
304
- )
305
-
306
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
307
- print(response)
308
- ```
309
-
310
- ---
311
-
312
- ### GCP Deployment (Vertex AI)
313
 
314
  ```python
315
- from google.cloud import aiplatform
316
- from transformers import AutoModelForCausalLM
317
  from peft import PeftModel
318
-
319
- # Initialize Vertex AI
320
- aiplatform.init(project='your-project', location='us-central1')
321
-
322
- # Deploy CodeGemma SecureCode to Vertex AI
323
- model = AutoModelForCausalLM.from_pretrained("google/codegemma-7b-it", device_map="auto")
324
- model = PeftModel.from_pretrained(model, "scthornton/codegemma-7b-securecode")
325
-
326
- # Upload to Vertex AI Model Registry
327
- # Deploy as endpoint for production use
328
- # Integrate with Cloud Run, GKE, or other GCP services
329
- ```
330
-
331
- ---
332
-
333
- ### Production Deployment (4-bit Quantization)
334
-
335
- ```python
336
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
337
- from peft import PeftModel
338
 
339
- # 4-bit quantization - runs on 16GB GPU
340
  bnb_config = BitsAndBytesConfig(
341
  load_in_4bit=True,
342
- bnb_4bit_use_double_quant=True,
343
  bnb_4bit_quant_type="nf4",
344
- bnb_4bit_compute_dtype="bfloat16"
345
  )
346
 
347
- model = AutoModelForCausalLM.from_pretrained(
348
  "google/codegemma-7b-it",
349
  quantization_config=bnb_config,
350
  device_map="auto",
351
- trust_remote_code=True
352
  )
 
 
353
 
354
- model = PeftModel.from_pretrained(model, "scthornton/codegemma-7b-securecode")
355
- tokenizer = AutoTokenizer.from_pretrained("google/codegemma-7b-it", trust_remote_code=True)
 
 
356
 
357
- # Production-ready: Runs on RTX 3090, RTX 4080, A5000, or GCP T4
 
 
358
  ```
359
 
360
- ---
361
 
362
- ## 📈 Performance & Benchmarks
363
 
364
- ### Hardware Requirements
365
 
366
- | Deployment | RAM | GPU VRAM | Tokens/Second | Latency (2K response) | Cost/Month |
367
- |-----------|-----|----------|---------------|----------------------|------------|
368
- | **4-bit Quantized** | 16GB | 12GB | ~40 tok/s | ~50 seconds | $0 (local) or $50-100 (cloud) |
369
- | **8-bit Quantized** | 20GB | 16GB | ~50 tok/s | ~40 seconds | $0 (local) or $100-150 (cloud) |
370
- | **Full Precision (bf16)** | 28GB | 20GB | ~65 tok/s | ~31 seconds | $0 (local) or $200-300 (cloud) |
371
- | **GCP Vertex AI** | Managed | Managed | ~60 tok/s | ~33 seconds | $150-250 (pay-per-use) |
372
 
373
- **GCP Integration Winner:** Native Vertex AI deployment with Google's infrastructure optimization.
374
 
375
- ### Real-World Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
 
377
- **Tested on RTX 3090 24GB** (consumer/prosumer GPU):
378
- - **Tokens/second:** ~40 tok/s (4-bit), ~60 tok/s (full precision)
379
- - **Cold start:** ~3 seconds
380
- - **Memory usage:** 10GB (4-bit), 16GB (full precision)
381
- - **Instruction following:** Excellent - implements 95%+ of specified requirements
382
 
383
- **Tested on GCP T4 GPU** (cloud deployment):
384
- - **Tokens/second:** ~35 tok/s (optimized for cost)
385
- - **Auto-scaling:** 0 to 100 instances in <60 seconds
386
- - **Cost efficiency:** $0.35/hour per instance
387
 
388
- ### Code Generation Quality
389
 
390
- **Instruction Following Benchmark:**
391
- - **Requirement compliance:** 95% (implements specified requirements accurately)
392
- - **Security specification adherence:** Excellent
393
- - **Consistency:** High - predictable, reliable outputs
394
 
395
- ---
396
-
397
- ## 💰 Cost Analysis
398
-
399
- ### Total Cost of Ownership (TCO) - 1 Year
400
-
401
- **Option 1: GCP Vertex AI (Recommended for GCP Users)**
402
- - Deployment: Managed Vertex AI endpoint
403
- - Cost: ~$0.50/hour (auto-scaling)
404
- - Usage: 500 hours/month
405
- - **Total Year 1:** $3,000/year
406
-
407
- **Option 2: Self-Hosted (Cloud GPU)**
408
- - GCP n1-highmem-8 + T4 GPU: $0.55/hour
409
- - Usage: 160 hours/month (development team)
410
- - **Total Year 1:** $1,056/year
411
-
412
- **Option 3: Self-Hosted (Local GPU)**
413
- - Hardware: RTX 3090 24GB - $1,000-1,200 (one-time)
414
- - Electricity: ~$60/year
415
- - **Total Year 1:** $1,060-1,260
416
- - **Total Year 2+:** $60/year
417
-
418
- **Option 4: Google Gemini API (for comparison)**
419
- - Cost: Variable pricing
420
- - Typical usage: $1,500-3,000/year for team
421
- - **Total Year 1:** $1,500-3,000/year
422
-
423
- **ROI Winner:** GCP Vertex AI for Google-first orgs (native integration). Local GPU for multi-cloud or cost optimization.
424
-
425
- ---
426
-
427
- ## 🎯 Use Cases & Examples
428
-
429
- ### 1. Secure API Generation with Precise Specifications
430
-
431
- Generate APIs that exactly match security requirements:
432
-
433
- ```python
434
- prompt = """### User:
435
- Create a secure payment processing API endpoint in Node.js/Express with:
436
- - Input validation using Joi
437
- - PCI-DSS compliant data handling
438
- - Stripe integration with webhook verification
439
- - Idempotency key support
440
- - Comprehensive error handling
441
- - Rate limiting (100 req/min)
442
- - Request/response logging to Stackdriver
443
-
444
- ### Assistant:
445
- """
446
- ```
447
-
448
- **Model Response:** Generates complete, production-ready code implementing ALL specified requirements.
449
-
450
- ---
451
-
452
- ### 2. Security Code Review with Structured Output
453
-
454
- Review code with predictable, structured responses:
455
-
456
- ```python
457
- prompt = """### User:
458
- Review this authentication code for OWASP Top 10 vulnerabilities. Provide output in this exact format:
459
- 1. Vulnerability Type
460
- 2. Severity (Critical/High/Medium/Low)
461
- 3. Affected Code Line
462
- 4. Exploitation Scenario
463
- 5. Secure Alternative
464
- 6. OWASP Category
465
-
466
- [Code to review]
467
-
468
- ### Assistant:
469
- """
470
- ```
471
-
472
- **Model Response:** Follows the exact format specified, reliable structured output.
473
-
474
- ---
475
-
476
- ### 3. Educational Content Generation
477
-
478
- Generate consistent educational examples:
479
-
480
- ```python
481
- prompt = """### User:
482
- Create a teaching example showing SQL injection vulnerability and fix. Include:
483
- 1. Vulnerable code with clear comments
484
- 2. Attack demonstration
485
- 3. Secure code with parameterized queries
486
- 4. Explanation suitable for beginners
487
- 5. Practice exercise
488
-
489
- ### Assistant:
490
- """
491
- ```
492
-
493
- **Model Response:** Generates clear, educational content following Google's technical writing standards.
494
-
495
- ---
496
-
497
- ## ⚠️ Limitations & Transparency
498
-
499
- ### What This Model Does Well
500
- ✅ Excellent instruction following for security requirements
501
- ✅ Consistent, predictable responses (Google quality)
502
- ✅ Strong code completion with security awareness
503
- ✅ Reliable implementation of specified security controls
504
- ✅ Clear, well-structured code generation
505
- ✅ Native GCP integration
506
-
507
- ### What This Model Doesn't Do
508
- ❌ **Not a security scanner** - Use tools like Semgrep, CodeQL, or Snyk
509
- ❌ **Not a penetration testing tool** - Cannot perform active exploitation
510
- ❌ **Not legal/compliance advice** - Consult security professionals
511
- ❌ **Not a replacement for security experts** - Critical systems need professional review
512
- ❌ **Not the largest context window** - 8K tokens (vs Qwen's 128K)
513
-
514
- ### Known Characteristics
515
- - **Instruction-focused:** Excels when given clear, structured requirements
516
- - **Consistent outputs:** Highly predictable - good for automation
517
- - **Google ecosystem:** Best performance when deployed on GCP
518
- - **Standard context:** 8K tokens sufficient for most code files
519
-
520
- ### Appropriate Use
521
- ✅ API generation with precise security requirements
522
- ✅ Code completion and IDE integration
523
- ✅ Educational platforms and training
524
- ✅ GCP-based development workflows
525
- ✅ Teams valuing Google engineering culture
526
-
527
- ### Inappropriate Use
528
- ❌ Sole security validation for production systems
529
- ❌ Replacement for professional security audits
530
- ❌ Active penetration testing without authorization
531
- ❌ Very large codebase analysis (use Qwen 14B instead)
532
-
533
- ---
534
-
535
- ## 🔬 Dataset Information
536
-
537
- This model was trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**, a production-grade security dataset with:
538
 
539
- - **1,209 total examples** (841 train / 175 validation / 193 test)
540
- - **100% incident grounding** - every example tied to real CVEs or security breaches
541
- - **11 vulnerability categories** - complete OWASP Top 10:2025 coverage
542
- - **11 programming languages** - from Python to Rust
543
- - **4-turn conversational structure** - mirrors real developer-AI workflows
544
- - **100% expert validation** - reviewed by independent security professionals
545
 
546
- See the [full dataset card](https://huggingface.co/datasets/scthornton/securecode-v2) and [research paper](https://perfecxion.ai/articles/securecode-v2-dataset-paper.html) for complete details.
547
 
548
- ---
549
-
550
- ## 🏢 About perfecXion.ai
551
 
552
- [perfecXion.ai](https://perfecxion.ai) is dedicated to advancing AI security through research, datasets, and production-grade security tooling.
553
 
554
- **Connect:**
555
- - Website: [perfecxion.ai](https://perfecxion.ai)
556
- - Research: [perfecxion.ai/research](https://perfecxion.ai/research)
557
- - Knowledge Hub: [perfecxion.ai/knowledge](https://perfecxion.ai/knowledge)
558
- - GitHub: [@scthornton](https://github.com/scthornton)
559
- - HuggingFace: [@scthornton](https://huggingface.co/scthornton)
560
- - Email: scott@perfecxion.ai
561
 
562
- ---
 
 
 
 
 
 
 
 
 
563
 
564
- ## 📄 License
565
 
566
- **Model License:** Apache 2.0 (permissive - use in commercial applications)
567
- **Dataset License:** CC BY-NC-SA 4.0 (non-commercial with attribution)
568
 
569
- ### What You CAN Do
570
- ✅ Use this model commercially in production applications
571
- Fine-tune further for your specific use case
572
- Deploy in enterprise environments
573
- Integrate into commercial products
574
- ✅ Distribute and modify the model weights
575
- ✅ Charge for services built on this model
576
 
577
- ### What You CANNOT Do with the Dataset
578
- ❌ Sell or redistribute the raw SecureCode v2.0 dataset commercially
579
- ❌ Use the dataset to train commercial models without releasing under the same license
580
- ❌ Remove attribution or claim ownership of the dataset
581
 
582
- For commercial dataset licensing or custom training, contact: scott@perfecxion.ai
583
-
584
- ---
 
 
585
 
586
- ## 📚 Citation
 
 
 
587
 
588
- If you use this model in your research or applications, please cite:
589
 
590
  ```bibtex
591
- @misc{thornton2025securecode-codegemma7b,
592
- title={CodeGemma 7B - SecureCode Edition},
593
  author={Thornton, Scott},
594
- year={2025},
595
  publisher={perfecXion.ai},
596
- url={https://huggingface.co/scthornton/codegemma-7b-securecode},
597
- note={Fine-tuned on SecureCode v2.0: https://huggingface.co/datasets/scthornton/securecode-v2}
598
- }
599
-
600
- @misc{thornton2025securecode-dataset,
601
- title={SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models},
602
- author={Thornton, Scott},
603
- year={2025},
604
- month={January},
605
- publisher={perfecXion.ai},
606
- url={https://perfecxion.ai/articles/securecode-v2-dataset-paper.html},
607
- note={Dataset: https://huggingface.co/datasets/scthornton/securecode-v2}
608
  }
609
  ```
610
 
611
- ---
612
-
613
- ## 🙏 Acknowledgments
614
-
615
- - **Google DeepMind & Google AI** for the excellent CodeGemma base model
616
- - **OWASP Foundation** for maintaining the Top 10 vulnerability taxonomy
617
- - **MITRE Corporation** for the CVE database and vulnerability research
618
- - **Security research community** for responsible disclosure practices
619
- - **Hugging Face** for model hosting and inference infrastructure
620
- - **GCP users** who validated this model in production environments
621
-
622
- ---
623
-
624
- ## 🤝 Contributing
625
-
626
- Found a security issue or have suggestions for improvement?
627
-
628
- - 🐛 **Report issues:** [GitHub Issues](https://github.com/scthornton/securecode-models/issues)
629
- - 💬 **Discuss improvements:** [HuggingFace Discussions](https://huggingface.co/scthornton/codegemma-7b-securecode/discussions)
630
- - 📧 **Contact:** scott@perfecxion.ai
631
-
632
- ### Community Contributions Welcome
633
-
634
- Especially interested in:
635
- - **GCP deployment examples** and Vertex AI integrations
636
- - **Benchmark evaluations** on security datasets
637
- - **Instruction-following assessments** for security tasks
638
- - **Production deployment case studies**
639
- - **Performance optimization** for GCP infrastructure
640
-
641
- ---
642
-
643
- ## 🔗 SecureCode Model Collection
644
-
645
- Explore other SecureCode fine-tuned models optimized for different use cases:
646
-
647
- ### Entry-Level Models (3-7B)
648
- - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)**
649
- - **Best for:** Consumer hardware, IDE integration, education
650
- - **Hardware:** 8GB RAM minimum
651
- - **Unique strength:** Most accessible
652
-
653
- - **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)**
654
- - **Best for:** Security-optimized baseline
655
- - **Hardware:** 16GB RAM
656
- - **Unique strength:** Security-first architecture
657
-
658
- - **[qwen2.5-coder-7b-securecode](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode)**
659
- - **Best for:** Best code understanding in 7B class
660
- - **Hardware:** 16GB RAM
661
- - **Unique strength:** 128K context, best-in-class
662
-
663
- - **[codegemma-7b-securecode](https://huggingface.co/scthornton/codegemma-7b-securecode)** ⭐ (YOU ARE HERE)
664
- - **Best for:** Google ecosystem, instruction following
665
- - **Hardware:** 16GB RAM
666
- - **Unique strength:** Google quality, GCP integration
667
-
668
- ### Mid-Range Models (13-15B)
669
- - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)**
670
- - **Best for:** Enterprise trust, Meta brand
671
- - **Hardware:** 24GB RAM
672
- - **Unique strength:** Proven track record
673
-
674
- - **[qwen2.5-coder-14b-securecode](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode)**
675
- - **Best for:** Advanced code analysis
676
- - **Hardware:** 32GB RAM
677
- - **Unique strength:** 128K context window
678
-
679
- - **[starcoder2-15b-securecode](https://huggingface.co/scthornton/starcoder2-15b-securecode)**
680
- - **Best for:** Multi-language projects (600+ languages)
681
- - **Hardware:** 32GB RAM
682
- - **Unique strength:** Broadest language support
683
 
684
- ### Enterprise-Scale Models (20B+)
685
- - **[granite-20b-code-securecode](https://huggingface.co/scthornton/granite-20b-code-securecode)**
686
- - **Best for:** Enterprise-scale, IBM trust
687
- - **Hardware:** 48GB RAM
688
- - **Unique strength:** Largest model, deepest analysis
689
 
690
- **View Complete Collection:** [SecureCode Models](https://huggingface.co/collections/scthornton/securecode)
691
 
692
- ---
693
-
694
- <div align="center">
695
-
696
- **Built with ❤️ for secure software development**
697
-
698
- [perfecXion.ai](https://perfecxion.ai) | [Research](https://perfecxion.ai/research) | [Knowledge Hub](https://perfecxion.ai/knowledge) | [Contact](mailto:scott@perfecxion.ai)
699
-
700
- ---
701
-
702
- *Google quality. Security expertise. Production ready.*
703
-
704
- </div>
 
1
  ---
2
+ license: gemma
3
  base_model: google/codegemma-7b-it
4
  tags:
5
+ - security
6
+ - cybersecurity
7
+ - secure-coding
8
+ - ai-security
9
+ - owasp
10
+ - code-generation
11
+ - qlora
12
+ - lora
13
+ - fine-tuned
14
+ - securecode
15
  datasets:
16
+ - scthornton/securecode
17
+ library_name: peft
 
 
18
  pipeline_tag: text-generation
19
+ language:
20
+ - code
21
+ - en
22
  ---
23
 
24
+ # CodeGemma 7B SecureCode
25
 
26
  <div align="center">
27
 
28
+ ![Parameters](https://img.shields.io/badge/params-7B-blue.svg)
29
+ ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
+ ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
+ ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
 
 
32
 
33
+ **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
 
35
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
 
37
  </div>
38
 
39
  ---
40
 
41
+ ## What This Model Does
42
 
43
+ This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
 
 
 
 
 
 
44
 
45
+ - Identifies the security risks in common coding patterns
46
+ - Provides vulnerable *and* secure implementations side by side
47
+ - Explains how attackers would exploit the vulnerability
48
+ - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
 
49
 
50
+ The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
 
 
51
 
52
+ ## Model Details
 
 
 
 
 
 
 
 
 
53
 
54
+ | | |
55
+ |---|---|
56
+ | **Base Model** | [CodeGemma 7B IT](https://huggingface.co/google/codegemma-7b-it) |
57
+ | **Parameters** | 7B |
58
+ | **Architecture** | Gemma |
59
+ | **Tier** | Tier 2: Mid-size Code Specialist |
60
+ | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
+ | **LoRA Rank** | 16 (alpha=32) |
62
+ | **Target Modules** | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (7 modules) |
63
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
+ | **Hardware** | NVIDIA A100 40GB |
65
 
66
+ Google's code-specialized Gemma variant. Strong instruction following with efficient architecture.
67
 
68
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```python
 
 
71
  from peft import PeftModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
+ import torch
74
 
75
+ # Load with 4-bit quantization (matches training)
76
  bnb_config = BitsAndBytesConfig(
77
  load_in_4bit=True,
 
78
  bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
  )
81
 
82
+ base_model = AutoModelForCausalLM.from_pretrained(
83
  "google/codegemma-7b-it",
84
  quantization_config=bnb_config,
85
  device_map="auto",
 
86
  )
87
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/codegemma-7b-securecode")
88
+ model = PeftModel.from_pretrained(base_model, "scthornton/codegemma-7b-securecode")
89
 
90
+ # Ask a security-relevant coding question
91
+ messages = [
92
+ {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
+ ]
94
 
95
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
+ outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
98
  ```
99
 
100
+ ## Training Details
101
 
102
+ ### Dataset
103
 
104
+ Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
105
 
106
+ - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
+ - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
+ - **12+ programming languages** and **49+ frameworks**
109
+ - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
+ - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
 
111
 
112
+ ### Hyperparameters
113
 
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | LoRA rank | 16 |
117
+ | LoRA alpha | 32 |
118
+ | LoRA dropout | 0.05 |
119
+ | Target modules | 7 linear layers |
120
+ | Quantization | 4-bit NormalFloat (NF4) |
121
+ | Learning rate | 2e-4 |
122
+ | LR scheduler | Cosine with 100-step warmup |
123
+ | Epochs | 3 |
124
+ | Per-device batch size | 2 |
125
+ | Gradient accumulation | 8x |
126
+ | Effective batch size | 16 |
127
+ | Max sequence length | 4096 tokens |
128
+ | Optimizer | paged_adamw_8bit |
129
+ | Precision | bf16 |
130
 
131
+ **Notes:** Requires `trust_remote_code=True`. Extended 4096-token context for full security conversations.
 
 
 
 
132
 
133
+ ## Security Coverage
 
 
 
134
 
135
+ ### Web Security (1,435 examples)
136
 
137
+ OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
 
 
 
138
 
139
+ Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
+ ### AI/ML Security (750 examples)
 
 
 
 
 
142
 
143
+ OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
144
 
145
+ Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
 
 
146
 
147
+ ## SecureCode Model Collection
148
 
149
+ This model is part of the **SecureCode** collection of 8 security-specialized models:
 
 
 
 
 
 
150
 
151
+ | Model | Base | Size | Tier | HuggingFace |
152
+ |-------|------|------|------|-------------|
153
+ | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
+ | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
+ | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
+ | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
+ | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
+ | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
+ | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
+ | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
+ Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
163
 
164
+ ## SecureCode Dataset Family
 
165
 
166
+ | Dataset | Examples | Focus | Link |
167
+ |---------|----------|-------|------|
168
+ | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
+ | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
+ | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
 
 
171
 
172
+ ## Intended Use
 
 
 
173
 
174
+ **Use this model for:**
175
+ - Training AI coding assistants to write secure code
176
+ - Security education and training
177
+ - Vulnerability research and secure code review
178
+ - Building security-aware development tools
179
 
180
+ **Do not use this model for:**
181
+ - Offensive exploitation or automated attack generation
182
+ - Circumventing security controls
183
+ - Any activity that violates the base model's license
184
 
185
+ ## Citation
186
 
187
  ```bibtex
188
+ @misc{thornton2026securecode,
189
+ title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
  author={Thornton, Scott},
191
+ year={2026},
192
  publisher={perfecXion.ai},
193
+ url={https://huggingface.co/datasets/scthornton/securecode},
194
+ note={arXiv:2512.18542}
 
 
 
 
 
 
 
 
 
 
195
  }
196
  ```
197
 
198
+ ## Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
+ - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
+ - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
+ - **Author**: [perfecXion.ai](https://perfecxion.ai)
 
204
 
205
+ ## License
206
 
207
+ This model is released under the **gemma** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.