Spaces:
Sleeping
π CommitGuard β Comprehensive GCP Deployment & Training Guide (A10G)
This document is a deep-dive, step-by-step manual for deploying the CommitGuard environment and training pipeline to a Google Cloud Platform (GCP) instance. We are targeting an NVIDIA A10G GPU to execute GRPO (Group Relative Policy Optimization) on the Llama-3.2-3B model.
π 1. Prerequisites: Setting Up Your Toolbox
Before you touch the cloud, you must ensure your local environment and external accounts are configured. These are the building blocks of the entire run.
A. GCP Account & Project Setup
- Active Project: You must have a GCP project created. Note your
PROJECT_ID. - GPU Quota: By default, GCP projects have 0 quota for GPUs. You must navigate to
IAM & Admin > Quotasand request a limit increase forNVIDIA_A10G_GPUSin your desired region (e.g.,us-central1). Do this 24 hours in advance.
B. Weights & Biases (WandB) for Visualization
- Why? RL training can be unstable. WandB allows you to monitor the "Reward" and "KL Divergence" curves in real-time from your browser.
- Action: Create a free account at wandb.ai, navigate to your settings, and copy your API Key.
C. Hugging Face Account & Llama Access
- Model Gating: Llama-3.2-3B is a gated model. You must visit the model page and apply for access. Approval usually takes 30-60 minutes.
- Access Token: Generate a "Write" token in your Hugging Face settings to allow the VM to download the model and upload your finished adapters.
D. Local gcloud CLI Initialization
- Installation: Install the Google Cloud SDK on your laptop.
- Authentication: Run
gcloud auth loginandgcloud config set project [YOUR_PROJECT_ID]. This allows your local terminal to "talk" to GCP.
π Step 1: Provisioning the High-Performance VM
We are using the G2 Standard 4 machine. It is specifically designed for AI workloads.
Detailed Breakdown of the Creation Command
--machine-type g2-standard-4: Provides 4 vCPUs and 16GB of system RAM, ensuring the CPU doesn't bottleneck the GPU.--accelerator type=nvidia-a10g,count=1: Attaches the A10G GPU. Its 24GB of VRAM is the "Goldilocks" zone for 3B parameter modelsβenough to handle the model plus the multiple "generations" required by the GRPO algorithm.--image-family common-cu121: Uses a specialized Google image that comes with CUDA 12.1 and NVIDIA drivers pre-installed. This saves you 30 minutes of manual driver installation.--provisioning-model=SPOT: CRITICAL FOR BUDGET. Spot instances use excess capacity and are ~70% cheaper than standard instances. If the instance is reclaimed by Google, your 50-step checkpoints ensure you don't lose much progress.
gcloud compute instances create commitguard-trainer \
--project=[PROJECT_ID] \
--zone=us-central1-a \
--machine-type=g2-standard-4 \
--accelerator=count=1,type=nvidia-a10g \
--image-project=ml-images \
--image-family=common-cu121 \
--boot-disk-size=100GB \
--boot-disk-type=pd-balanced \
--maintenance-policy=TERMINATE \
--provisioning-model=SPOT
π Step 2: Environment Preparation
Once the VM is "Running," we need to turn it into a specialized CommitGuard lab.
A. Secure Connection (SSH)
Connect to the machine's terminal:
gcloud compute ssh commitguard-trainer --zone=us-central1-a
B. Repository & Virtual Environment
We isolate our dependencies to prevent conflicts with system-level Python packages.
# Clone the project
git clone https://github.com/[YOUR_USER]/commitguard.git
cd commitguard
# Create a 'venv' (Virtual Environment)
python3 -m venv .venv
source .venv/bin/activate
# Authenticate with Hugging Face (Required for gated Llama models)
huggingface-cli login
C. Installing the "Train" Stack
The -e ".[train]" command installs the commitguard package in "editable" mode along with all optional training libraries like torch, peft, and trl.
pip install -U pip
pip install -e ".[train]"
# Flash Attention 2: This is a specialized kernel that makes Llama training
# significantly faster and more memory-efficient on A10G hardware.
pip install flash-attn --no-build-isolation
π‘ Step 3: Launching the Verifiable Reward Server
CommitGuard uses RLVR. In this setup, the model doesn't just "guess" if it's right; it submits an action to a server that calculates a reward based on hard evidence.
Running in the Background
Since training takes hours, we run the server in the background using the & symbol.
# Start the server
python -m commitguard_env.server &
# Verify Health: This ensures the database and API are ready.
# If this fails, the trainer will hang indefinitely.
curl http://localhost:8000/health
# You should see: {"status":"healthy"}
π§ Step 4: Executing the GRPO Training Run
GRPO is a "reinforcement learning" algorithm. It asks the model to generate 4 different answers for the same code diff, compares them to each other, and rewards the ones that follow the XML format and correctly identify the vulnerability.
Hyperparameter Explanation
--steps 500: The model will see roughly 2,000 examples (4 generations x 500 steps).4-bit Quantization: Automatically handled by the script. It "compresses" the model weights so they fit into the GPU's memory without losing accuracy.LoRA r=8: "Low-Rank Adaptation." Instead of training 3 billion parameters, we only train about 5 million. This makes training stable and fast.--live: Tells the script to fetch rewards from the server we started in Step 3.
# Login to WandB so your graphs show up online
export WANDB_API_KEY=[YOUR_WANDB_KEY]
python scripts/train_grpo.py \
--model_name "meta-llama/Llama-3.2-3B-Instruct" \
--output_dir "./outputs/commitguard-final" \
--steps 500 \
--live \
--wandb "commitguard-rlvr"
πΎ Step 5: Post-Run Weight Management & Cleanup
Once the 500 steps are complete, the "brain" of your agent exists as a LoRA adapter in the ./outputs folder.
A. Permanent Storage (Hugging Face)
The VM's disk is temporary. Move your weights to Hugging Face immediately.
huggingface-cli login --token [YOUR_HF_TOKEN]
huggingface-cli upload [HF_USERNAME]/commitguard-llama3b-adapter ./outputs/commitguard-final
B. Cost Control: Deleting the VM
DO NOT FORGET THIS STEP. An idle A10G instance costs money every hour.
# Exit the VM
exit
# Delete from your local terminal
gcloud compute instances delete commitguard-trainer --zone=us-central1-a
π Critical Troubleshooting
"CUDA Out of Memory"
- Symptom: Training crashes with a long error ending in
OutOfMemoryError. - Fix: The "Group" in GRPO is currently set to 4 generations. Open
scripts/train_grpo.pyand changenum_generations=4tonum_generations=2. This cuts memory usage in half.
"Connection Refused"
- Symptom: Reward function returns -1.0 for everything or throws errors.
- Fix: Your environment server crashed or wasn't started. Run
ps aux | grep serverto check if it is still running.
The "Midnight Fallback"
If the 3B model is too slow for the submission deadline:
- Switch to the 1.5B Qwen model. It uses the same XML format but is 2x faster.
- Command:
python scripts/train_grpo.py --model_name "Qwen/Qwen2.5-1.5B-Instruct" ...
β Final Success Checklist
- Health Check:
curlreturns healthy. - WandB Tracking: You can see the
rewardcurve moving on the website. - Checkpoints: You see folders like
checkpoint-50,checkpoint-100in the output directory. - Clean Exit: The VM is deleted after the adapter is uploaded to Hugging Face.