Training a SQL Database Engineer Agent with GRPO on Qwen2.5
Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning
Overview
Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
In this project, we trained Qwen2.5-7B-Instruct to act as a senior database engineer β inspecting slow queries, identifying missing indexes, and applying targeted fixes β using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
After 200 training steps, the agent achieved a +94% reward improvement (0.235 β 0.456) and outperformed a random baseline by an average of +31.4 database performance points across 15 scenarios.
The Problem
Given a database with slow-running SQL queries, the agent must:
- Investigate β understand why queries are slow
- Diagnose β identify missing indexes or inefficient query patterns
- Fix β apply the correct indexes and optimizations
- Verify β confirm the performance score improved
A random agent that creates indexes on arbitrary columns scores 0 pts on every scenario. Our trained agent had to learn β purely from feedback β which tables and columns actually matter.
Architecture
Environment β DatabaseSimulator
We built a custom DatabaseSimulator that:
- Loads SQL scenarios (tables, slow queries, missing index hints)
- Tracks a performance score (0β100) that improves when correct indexes are applied
- Returns delta rewards based on how much the score improved
- Runs locally β no HTTP calls, no shared state, fully deterministic
Scenarios
We created 15 scenarios across 3 difficulty levels:
| Level | Count | Description |
|---|---|---|
| Easy | 5 | Single table, one missing index |
| Medium | 5 | E-commerce DB, composite indexes |
| Hard | 5 | 4-table financial schema, complex joins |
Action Space
The agent can take 10 actions:
{"action_type": "inspect_query", "payload": {"query_id": "q1"}}
{"action_type": "analyze_indexes", "payload": {}}
{"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
{"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
{"action_type": "analyze_statistics","payload": {"table": "orders"}}
{"action_type": "submit_report", "payload": {"summary": "..."}}
Training Setup
Model
- Base model:
unsloth/Qwen2.5-7B-Instruct(7.66B parameters) - Trainable parameters: 40,370,176 of 7,655,986,688 (only 0.53% via LoRA)
- Fine-tuning: LoRA (r=16, alpha=16) via Unsloth β 2x faster free finetuning
- Training algorithm: GRPO (Group Relative Policy Optimization)
- Framework: TRL + Unsloth + PyTorch
- GPU: Single GPU (1x)
Training Data
- Examples: 15 scenarios
- Epochs: 29 (cycling through all 15 scenarios)
- Total steps: 200
- Effective batch size: 8 (batch size 4 Γ gradient accumulation 2 Γ 1 GPU)
GRPO Reward Function
The reward function combines three signals:
total_reward = step_reward + delta_reward + milestone_bonus
| Component | Description | Range |
|---|---|---|
step_reward |
Base reward per valid action type | 0.05β0.20 |
delta_reward |
Proportional to DB performance improvement | 0.0β0.65 |
milestone_bonus |
Bonus at 25%, 50%, 75% improvement thresholds | 0.15β0.40 |
wrong_index_penalty |
Penalty for indexing useless columns | -0.05 |
Expected rewards per action:
inspect_query / analyze_indexes β ~0.10
create_index (no table/col match) β ~0.10
create_index (partial hint match) β ~0.20β0.45
create_index (perfect hint match) β ~0.55β0.80
create_index (simulator confirms) β ~0.75β0.99
Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
Key design decision: We used a hint-match fallback to give GRPO a gradient signal early in training β before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
Training Config
GRPOConfig(
max_steps = 200,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 2,
learning_rate = 2e-5,
max_completion_length = 150,
num_generations = 4,
temperature = 1.0,
warmup_steps = 10,
)
Results
Training Curves
After 200 steps of GRPO training:
- Loss:
4.92e-07 β 1.23e-05(GRPO policy loss rises as the model becomes more confident in its policy β this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.) - Reward:
0.235 β 0.456 (+94% improvement)The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
Evaluation β Trained vs Random Agent
We evaluated both agents across all 15 scenarios:
| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|---|---|---|---|
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
| Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
The trained agent outperformed the random baseline on every single scenario, with an average improvement of +31.4 database performance points. Scenario 8 was flagged as a statistical outlier (Β±1.5Ο above mean) β the agent found an especially impactful index combination. The relative gain is β since the untrained baseline scored exactly 0 on all scenarios.
Key Learnings
1. Reward shaping is everything in GRPO
The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
2. LoRA makes 7B models trainable on a single GPU
With only 0.53% of parameters trainable via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
3. Local simulation beats API calls for training
Using DatabaseSimulator directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
4. GRPO loss behaviour differs from supervised loss
Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β what matters is whether the reward is trending upward.
5. Composite indexes are hard to learn
The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
Live Demo
Try the agent yourself β pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
SQL Database Engineer Agent β Live Demo
Resources
| Resource | Link |
|---|---|
| Demo Space | https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
| Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
| HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent | |
| Training Run Notebook URL | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb |
| Google Collab | https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
What's Next
- More steps: 200 steps showed strong learning β 500+ steps would likely push the average score above 50 pts
- Harder scenarios: 8-table schemas with nested subqueries and CTEs
- Query rewriting: The agent currently focuses on indexing β teaching it to rewrite SQL itself is the next frontier
- Multi-step episodes: Chain multiple actions per episode so the agent can inspect β diagnose β fix β verify in sequence
Acknowledgements
Built for the META Γ PyTorch Γ SST Hackathon using:
- Unsloth β 2x faster LoRA fine-tuning
- TRL β GRPO implementation
- Hugging Face β model hosting and Spaces
- Qwen2.5 β base language model