Slonik-7B-GRPO

A PostgreSQL and SQLite text-to-SQL model, fine-tuned from Qwen2.5-Coder-7B-Instruct via QLoRA SFT followed by 2000-step GRPO with execution-based rewards.

Related repos:

Why I built this

I wanted a small text-to-SQL model that handled real PostgreSQL and SQLite questions — JSONB access, pgvector similarity, full-text search, window functions, deep CTEs — and was small enough to run locally. SFT alone got partway there (33.20% on BIRD-PG, already past most open 7B and 32B baselines), but the model still produced syntactically clean queries that referenced columns the schema didn't have. That's the pattern execution-based RL is built to fix.

This is the GRPO version, with 5 points more accuracy on BIRD-PG and a clear pattern of improvement on dialect-specific issues.

Results

BIRD Mini-Dev (500-example official benchmarks, execution accuracy):

Model BIRD-PG BIRD-SQLite Size
o3-mini 47.78% reasoning
Claude 3.7 Sonnet 39.26% proprietary
Slonik-7B-GRPO (this) 38.20% 45.20% 7B
GPT-4o 34.44% proprietary
Slonik-7B-SFT (sibling) 33.20% 7B
Qwen2.5-Coder-32B 22.96% 32B
Codestral 22B 21.11% 22B
Qwen2.5-Coder-7B (base) 12.22% 7B

Performance by difficulty

Tier BIRD-PG BIRD-SQLite
Simple 56.1% 66.2%
Moderate 33.6% 38.0%
Challenging 23.5% 32.4%

SFT → GRPO trajectory on BIRD-PG

Stage Overall Simple Moderate Challenging
Base Qwen2.5-Coder-7B 12.22%
Slonik-7B-SFT 33.20% 48.6% 29.6% 19.6%
Slonik-7B-GRPO (500 steps) 34.60% 49.3% 31.2% 21.6%
Slonik-7B-GRPO (2000 steps) 38.20% 56.1% 33.6% 23.5%

Largest absolute gains were on simple (+7.5 pts vs SFT) and moderate (+4.0 pts). Hardest tier moved less, which lines up with what 7B models can do given short context budgets.

Training

Two stages on a single RTX 5080 Laptop GPU (16 GB VRAM, Blackwell sm_120). Total external cost about $3 (DeepSeek API for synthetic data generation).

Stage 1 — QLoRA SFT (8h 13min)

QLoRA fine-tune of Qwen2.5-Coder-7B-Instruct on 21,847 text-to-SQL pairs:

  • BIRD-SQL train split — 6,601 examples
  • Spider train split — 8,034 examples
  • Gretel synthetic text-to-SQL PostgreSQL subset — 5,212 examples
  • PG-Modern custom synthesis — 2,000 examples covering pgvector, JSONB, full-text search, CTEs, window functions, and array operations

LoRA rank 32, alpha 64, 4-bit NF4 base, LR 1e-5, max_grad_norm 0.5, adamw_torch_fused. Final eval_loss 0.290.

Stage 2 — GRPO with execution rewards (16h)

GRPO with three reward signals: weighted execution match against BIRD SQLite databases (1.0), syntax validity via sqlglot (0.2), and code-fence formatting (0.1). 2000 steps total, num_generations=2, LR 5e-6.

The 16-hour wall time is from disabling vLLM rollouts (the available vLLM wheels are built for CUDA 13 and don't load on my CUDA 12.8 Blackwell driver). With vLLM, the same 2000 steps would have taken closer to 2–3 hours.

What GRPO actually fixed

Looking at the 500 BIRD-PG examples, GRPO fixed 12 queries that SFT got wrong and broke 6 that SFT had right — net +6, plus the broader trend of better dialect awareness.

The biggest improvement was dialect awareness. SFT kept generating MONTH(date) — that's MySQL syntax and just fails on Postgres. GRPO learned EXTRACT(MONTH FROM date) from the executions that came back as errors.

It also got better at date formats. SFT was guessing patterns like LIKE '%/%/87%' (assuming mm/dd/yy), which returned empty result sets against dates stored as YYYY-MM-DD. GRPO settled on LIKE '%1987%' after enough wrong-answer signals.

A smaller but interesting one: it learned when not to quote identifiers. SFT was over-quoting in cases where the DDL was unquoted, which broke case-sensitive matches.

Limitations

This is not a general SQL assistant for every dialect — it's tuned around PostgreSQL and SQLite specifically. Behavior on MySQL or SQL Server isn't validated.

The 7B size still shows up on harder examples. Challenging-tier BIRD-PG accuracy is 23.5%, and schema grounding is imperfect on tables with 30+ columns, where most remaining errors are hallucinated column names. My guess is that's a 7B context-handling limitation more than a training-data issue.

GRPO has its own failure mode I observed in the eval comparison: it occasionally over-quotes identifiers or adds unnecessary DISTINCT clauses. The 6 regressions across 500 BIRD-PG examples (against the SFT baseline) come from this pattern. The net gain was still positive, but it's one weakness of binary execution rewards — the model can't always distinguish between "succeeded because of better grounding" and "succeeded because of incidental stylistic choices in the rollout."

Notes for Blackwell laptops

On RTX 5080 / sm_120, vLLM CUDA 13 wheels didn't load on the CUDA 12.x runtime, so both stages trained through Unsloth's Triton fallback (no flash-attn, no nvcc). AdamW 8-bit produced NaNs within the first 100 SFT steps every time; adamw_torch_fused with LR 1e-5 and grad clipping at 0.5 stabilized SFT. For GRPO, the key stability fix was catching every exception type from sqlglot in the reward function — a TokenError from an unterminated string literal in one rollout crashed the run at step 320 the first time around.

Author

Phani

Downloads last month
15
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Phani-labs/Slonik-7B-GRPO

Base model

Qwen/Qwen2.5-7B
Finetuned
(399)
this model
Quantizations
1 model

Datasets used to train Phani-labs/Slonik-7B-GRPO

Paper for Phani-labs/Slonik-7B-GRPO