sql-db-engineer-agent / blog_post.md
junaid0600's picture
Update blog_post.md
ac0ea5f verified

Training a SQL Database Engineer Agent with GRPO on Qwen2.5

Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning


Overview

Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β€” yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?

In this project, we trained Qwen2.5-7B-Instruct to act as a senior database engineer β€” inspecting slow queries, identifying missing indexes, and applying targeted fixes β€” using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.

After 200 training steps, the agent achieved a +94% reward improvement (0.235 β†’ 0.456) and outperformed a random baseline by an average of +31.4 database performance points across 15 scenarios.


The Problem

Given a database with slow-running SQL queries, the agent must:

  1. Investigate β€” understand why queries are slow
  2. Diagnose β€” identify missing indexes or inefficient query patterns
  3. Fix β€” apply the correct indexes and optimizations
  4. Verify β€” confirm the performance score improved

A random agent that creates indexes on arbitrary columns scores 0 pts on every scenario. Our trained agent had to learn β€” purely from feedback β€” which tables and columns actually matter.


Architecture

Environment β€” DatabaseSimulator

We built a custom DatabaseSimulator that:

  • Loads SQL scenarios (tables, slow queries, missing index hints)
  • Tracks a performance score (0–100) that improves when correct indexes are applied
  • Returns delta rewards based on how much the score improved
  • Runs locally β€” no HTTP calls, no shared state, fully deterministic

Scenarios

We created 15 scenarios across 3 difficulty levels:

Level Count Description
Easy 5 Single table, one missing index
Medium 5 E-commerce DB, composite indexes
Hard 5 4-table financial schema, complex joins

Action Space

The agent can take 10 actions:

{"action_type": "inspect_query",     "payload": {"query_id": "q1"}}
{"action_type": "analyze_indexes",   "payload": {}}
{"action_type": "create_index",      "payload": {"table": "orders", "columns": ["user_id", "status"]}}
{"action_type": "rewrite_query",     "payload": {"query_id": "q1", "new_sql": "..."}}
{"action_type": "analyze_statistics","payload": {"table": "orders"}}
{"action_type": "submit_report",     "payload": {"summary": "..."}}

Training Setup

Model

  • Base model: unsloth/Qwen2.5-7B-Instruct (7.66B parameters)
  • Trainable parameters: 40,370,176 of 7,655,986,688 (only 0.53% via LoRA)
  • Fine-tuning: LoRA (r=16, alpha=16) via Unsloth β€” 2x faster free finetuning
  • Training algorithm: GRPO (Group Relative Policy Optimization)
  • Framework: TRL + Unsloth + PyTorch
  • GPU: Single GPU (1x)

Training Data

  • Examples: 15 scenarios
  • Epochs: 29 (cycling through all 15 scenarios)
  • Total steps: 200
  • Effective batch size: 8 (batch size 4 Γ— gradient accumulation 2 Γ— 1 GPU)

GRPO Reward Function

The reward function combines three signals:

total_reward = step_reward + delta_reward + milestone_bonus
Component Description Range
step_reward Base reward per valid action type 0.05–0.20
delta_reward Proportional to DB performance improvement 0.0–0.65
milestone_bonus Bonus at 25%, 50%, 75% improvement thresholds 0.15–0.40
wrong_index_penalty Penalty for indexing useless columns -0.05

Expected rewards per action:

inspect_query / analyze_indexes       β†’  ~0.10
create_index (no table/col match)     β†’  ~0.10
create_index (partial hint match)     β†’  ~0.20–0.45
create_index (perfect hint match)     β†’  ~0.55–0.80
create_index (simulator confirms)     β†’  ~0.75–0.99
Milestones: 25%=+0.15  50%=+0.25  75%=+0.40  (cumulative)

Key design decision: We used a hint-match fallback to give GRPO a gradient signal early in training β€” before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.

Training Config

GRPOConfig(
    max_steps                   = 200,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 2,
    learning_rate               = 2e-5,
    max_completion_length       = 150,
    num_generations             = 4,
    temperature                 = 1.0,
    warmup_steps                = 10,
)

Results

Training Curves

After 200 steps of GRPO training:

  • Loss: 4.92e-07 β†’ 1.23e-05 (GRPO policy loss rises as the model becomes more confident in its policy β€” this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)
  • Reward: 0.235 β†’ 0.456 (+94% improvement) The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.

Evaluation β€” Trained vs Random Agent

We evaluated both agents across all 15 scenarios:

Agent Avg Improvement Best Scenario Worst Scenario
Random (wrong index) +0.0 pts 0 pts 0 pts
Trained (GRPO) +31.4 pts +59 pts (Scenario 8 ) +10 pts

The trained agent outperformed the random baseline on every single scenario, with an average improvement of +31.4 database performance points. Scenario 8 was flagged as a statistical outlier (Β±1.5Οƒ above mean) β€” the agent found an especially impactful index combination. The relative gain is ∞ since the untrained baseline scored exactly 0 on all scenarios.


Key Learnings

1. Reward shaping is everything in GRPO

The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.

2. LoRA makes 7B models trainable on a single GPU

With only 0.53% of parameters trainable via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.

3. Local simulation beats API calls for training

Using DatabaseSimulator directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.

4. GRPO loss behaviour differs from supervised loss

Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β€” what matters is whether the reward is trending upward.

5. Composite indexes are hard to learn

The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.


Live Demo

Try the agent yourself β€” pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:

SQL Database Engineer Agent β€” Live Demo


Resources


What's Next

  • More steps: 200 steps showed strong learning β€” 500+ steps would likely push the average score above 50 pts
  • Harder scenarios: 8-table schemas with nested subqueries and CTEs
  • Query rewriting: The agent currently focuses on indexing β€” teaching it to rewrite SQL itself is the next frontier
  • Multi-step episodes: Chain multiple actions per episode so the agent can inspect β†’ diagnose β†’ fix β†’ verify in sequence

Acknowledgements

Built for the META Γ— PyTorch Γ— SST Hackathon using:

  • Unsloth β€” 2x faster LoRA fine-tuning
  • TRL β€” GRPO implementation
  • Hugging Face β€” model hosting and Spaces
  • Qwen2.5 β€” base language model