Spaces:

laterabhi
/

grpo-sql-optimizer

Running

App Files Files Community

laterabhi commited on 9 days ago

Commit

0c6bf6e

verified ·

1 Parent(s): 979f139

Update Blog.md

Browse files

Files changed (1) hide show

Blog.md +86 -61

Blog.md CHANGED Viewed

@@ -1,84 +1,109 @@
-# GRPO Training for SQL Query Optimization
-## Overview
-Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using GRPO (Group Relative Policy Optimization)
-reinforcement learning to optimize SQL queries using a DuckDB execution environment.
-## Problem Statement
-SQL query optimization is critical for database performance. This project trains an LLM
-to automatically identify and fix SQL anti-patterns using RL with verifiable rewards.
-## Approach
-### Environment
-- Used [SQL Query Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
-- DuckDB-based execution environment with 5 tasks of increasing difficulty
-- Tasks: basic antipatterns, correlated subqueries, wildcard scans, implicit joins, window functions
-### GRPO Training
-- **Algorithm:** GRPO (Group Relative Policy Optimization)
-- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
-- **Episodes:** 100
-- **Group Size:** 4 completions per prompt
-- **Hardware:** Kaggle GPU T4 x2
-### Reward Function
-The reward function combines multiple signals:
-- `execution_speedup`: How much faster the optimized query runs
-- `result_correctness`: Whether the optimized query returns identical results
-- `issue_detection`: Whether SQL anti-patterns were correctly identified
-- `approval_correctness`: Whether the approval flag is set correctly
-- `summary_quality`: Quality of the explanation
-- `severity_labels`: Correctness of severity ratings
-Bonus reward added for correct issue detection even when SQL execution fails,
-providing a useful gradient signal for partial progress.
-## Results
-### Training Progress
 | Metric | Value |
 |--------|-------|
-| Start avg (ep1-10) | 0.3090 |
-| End avg (ep91-100) | 0.5962 |
-| Improvement | +93% |
-### Final Evaluation
 | Task | Difficulty | Score |
 |------|-----------|-------|
-| task_1_basic_antipatterns | easy | 0.7500 ✅ |
-| task_2_correlated_subqueries | medium | 0.8313 ✅ |
-| task_3_wildcard_scan | medium-hard | 0.9250 ✅ |
-| task_4_implicit_join | hard | 0.6438 ✅ |
-| task_5_window_functions | expert | 0.6250 ⚠️ |
-| **Average** | | **0.7550** |
-**Baseline (original query unchanged): 0.6300**
-**Improvement over baseline: +0.1250 (+12.5%)**
-### Training Curve
-![Training Curve](grpo_results.png)
-## Key Findings
-1. **Reward variance is critical** — Early runs had flat 0.08 rewards because the model
-   generated invalid SQL. Fixing the prompt to include schema information created reward
-   variance needed for GRPO to learn.
-2. **Prompt engineering matters for RL** — Explicitly telling the model to use only
-   columns from the schema was the single most impactful fix.
-3. **Partial credit helps** — Adding issue detection bonus gave the model a learning
-   signal even when SQL execution failed.
-4. **Task difficulty affects learning** — Harder tasks (implicit joins, window functions)
-   consistently scored lower, suggesting curriculum learning could help.
-## Model
-https://huggingface.co/laterabhi/grpo-sql-optimizer
-## References
-- [GRPO Paper - DeepSeekMath](https://arxiv.org/abs/2402.03300)
-- [TRL Library](https://huggingface.co/docs/trl)
-- [SQL Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
-- [Qwen2.5 Model](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)

+# GRPO Training for SQL Query Optimization (DuckDB-Verifiable Rewards)
+Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using **GRPO (Group Relative Policy Optimization)** to optimize SQL queries with **verifiable rewards**: we **execute** the original + rewritten SQL against a real DuckDB database and score based on measured speedup and correctness.
+- **Repo (source of truth):** https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-
+- **Model:** https://huggingface.co/laterabhi/grpo-sql-optimizer
+- **Space:** https://huggingface.co/spaces/laterabhi/grpo-sql-optimizer
+---
+## Why this matters
+LLMs often generate SQL that is syntactically valid but **slow** (or subtly wrong) at scale. Classic training setups use heuristic scoring, which can be gamed. This project trains/evaluates SQL optimization with **execution-grounded feedback**.
+---
+## Environment (5 tasks, increasing difficulty)
+We use the **SQL Query Optimization Environment** (OpenEnv compliant), backed by an in-memory DuckDB dataset:
+- `users` (10k), `orders` (500k), `events` (1M), `products` (1k)
+Tasks:
+1. `task_1_basic_antipatterns` (easy)
+2. `task_2_correlated_subqueries` (medium)
+3. `task_3_wildcard_scan` (medium-hard)
+4. `task_4_implicit_join` (hard)
+5. `task_5_window_functions` (expert)
+---
+## Reward function (execution-grounded)
+Composite reward in \[0, 1\], combining:
+- **execution_speedup (35%)**: measured ratio `original_ms / optimized_ms` from DuckDB
+- **result_correctness (20%)**: results match check (order-independent for large outputs)
+- **issue_detection (25%)**: anti-pattern detection vs ground-truth keywords per task
+- **approval_correctness (8%)**
+- **summary_quality (7%)**
+- **severity_labels (5%)**
+This is designed to be **hard to game**: “fast but wrong” loses correctness; “verbose but slow” loses speedup.
+---
+## Training setup (GRPO)
+- **Algorithm:** GRPO (group-relative policy optimization)
+- **Base model:** `Qwen/Qwen2.5-0.5B-Instruct`
+- **Group size:** 4 completions per prompt
+- **Notebook:** Kaggle (linked in repo README)
+- **Note:** `train.py` defaults to 200 episodes, but the reported curve/table below is from the 100-episode run described in the repo.
+---
+## Results (from the GitHub repo)
+### Training progress (100 episodes)
 | Metric | Value |
 |--------|-------|
+| Start avg (ep 1–10) | 0.3090 |
+| End avg (ep 91–100) | 0.5962 |
+| Improvement | **+93%** |
+**Reward curve:**
+![GRPO reward curve](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/grpo_reward_curve.png)
+### Final evaluation (per task)
 | Task | Difficulty | Score |
 |------|-----------|-------|
+| task_1_basic_antipatterns | easy | **0.7500** ✅ |
+| task_2_correlated_subqueries | medium | **0.8313** ✅ |
+| task_3_wildcard_scan | medium-hard | **0.6563** ✅ |
+| task_4_implicit_join | hard | **0.6563** ✅ |
+| task_5_window_functions | expert | **0.6500** ✅ |
+> **Task 5 note:** `task_5_window_functions` is the **expert** scenario, so it’s expected to be the lowest. This is not an error—just the hardest distribution.
+### “Before / After” (environment-only, no API keys)
+We also provide a **reproducible** before/after contrast:
+- **Before:** suggestions present but `optimized_query` empty (no speedup/correctness signal)
+- **After:** deterministic fallback policy with a real optimized query
+Table + chart are committed in the repo:
+- `results/before_after_table.md`
+- `results/before_after_chart.png`
+![Before/After chart](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/before_after_chart.png)
+---
+## How to reproduce (locally)
+```bash
+git clone https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-.git
+cd SQL-Query-Optimization-Environment-
+pip install -r requirements.txt
+# Baselines (fallback + optional LLM if HF_TOKEN set)
+python baseline_runner.py
+# Environment-only before/after (no API keys)
+python training/eval_before_after.py --save-dir results