Spaces:

md896
/

sql-debug-env

Running

App Files Files Community

sql-debug-env / docs /MASTER_MANUAL.md

md896

Fix: Mock vllm and llm_blender to stabilize GRPOTrainer in HF Jobs environment

bc20ef9 12 days ago

preview code

raw

history blame contribute delete

6.93 kB

🏆 SQL Debug Env: The Ultimate Master Manual

Comprehensive Wiki & Technical Bible for the Meta PyTorch × OpenEnv Hackathon

📖 Table of Contents

The "Simple" Concept
Architecture: How the Machine Works
The Industry Benchmark: Spider vs. BIRD vs. YOU
Deep-Dive: The Codebase Map
The Science: GRPO & Reinforcement Learning
The "Day in the Life" of a SQL Query
Current Project Status & Roadmap
Live Spider Evaluation (The "Ultimate Proof")
Winning the Q&A (The Cheat Sheet)

1. The "Simple" Concept

Imagine you are a teacher. You have a student (the AI) who is good at English but bad at Math (the SQL). Instead of just giving the student a textbook, you put them in a room with a calculator (the Database). The student tries a problem, uses the calculator, sees the answer is wrong, and tries again. You have built the Room, the Calculator, and the Reward System (the "Stars") that makes the student smarter.

2. Architecture: How the Machine Works

The project is split into two main "Brains":

A. The Environment (The Body / server/)

This is the "physical world" where the SQL lives.

FastAPI: The "telephone" that lets the AI talk to the database.
SQLite: The "sandbox" where queries are actually run.
Graders: The "judge" that compares the result of the AI's query to the "truth."

B. The Agent (The Brain / grpo_train.py)

This is the intelligence that is trying to learn.

Model (Qwen2.5-Coder): The actual neural network.
GRPO Logic: The mathematical formula that tells the model: "Fix #3 was better than Fix #1, change your weights to be more like #3."

3. The Industry Benchmark: Spider vs. BIRD vs. YOU

Judge Question: "Why should we use your environment instead of existing datasets like Spider?"

Feature	Spider / BIRD (Standard)	SQL Debug Env (YOU)
Task Type	One-Shot Generation	Iterative Debugging
Feedback	None (Static)	Live Database Feedback
Difficulty	High-level Text-to-SQL	Low-level Logic/Syntax Fixes
Evaluation	Fuzzy (String matching)	Deterministic (Row matching)

The Reference: Your project is inspired by the DeepSeek R1 and OpenAI o1 reasoning models. You are applying their "Reinforcement Learning from Feedback" (RLHF) philosophy to the niche world of SQL engineering.

4. Deep-Dive: The Codebase Map

File	What is it?	Why is it here?
`server/main.py`	The Heart	Acts as the API server. It handles `/reset` (new game) and `/step` (make a move).
`server/env.py`	The World	Manages the session state. It knows if the user is in Task 1 or Task 3.
`server/database.py`	The Sandbox	Creates temporary SQLite databases in memory so the AI can't break anything.
`server/reward.py`	The Scorekeeper	Calculates the "Reward" (0.0 to 1.0). It checks syntax, efficiency, and correctness.
`grpo_train.py`	The Trainer	The script that actually "upgrades" the AI's brain using RL.
`inference.py`	The Test	A simple script to see how smart the AI is right now before training.
`openenv.yaml`	The ID Card	Tells the hackathon platform how to connect to your project.

5. The Science: GRPO & Reinforcement Learning

If a judge asks: "How does it learn?"

The Old Way: SFT (Supervised Fine-Tuning)

You show the AI 1,000 "Correct" answers.
Problem: The AI just memorizes. It doesn't learn how to "debug" when it sees a new error.

Your Way: GRPO (Group Relative Policy Optimization)

Step 1: The AI looks at a broken query.
Step 2: It generates 4 different ways to fix it (a "Group").
Step 3: We run all 4 in the database and get 4 scores.
Step 4: We compare them. We tell the AI: "Compared to your other 3 tries, your 2nd try was the best. Do more of that."
Innovation: This is "Self-Generated Reasoning." The AI is its own teacher.

6. The "Day in the Life" of a SQL Query

Follow a query from start to finish:

The Prompt: "Fix this query: SELECT * FROM userss (typo)."
The Reviewer: Your reviewer_check in main.py looks at it. If it sees DROP TABLE, it rejects it immediately.
The Sandbox: The query is run in a private SQLite memory space.
The Comparison: The system runs the "Correct" query in the background. It compares the rows.
The Reward: If the rows match, the AI gets +1.0. If they don't, but the syntax is valid, it gets +0.2.
The Memory: The AI updates its "Weights" (its digital brain) to remember this success.

7. Current Project Status & Roadmap

Project Completion: 95%

✅ Completed:

Core FastAPI Server & SQLite Sandbox.
3 Realistic SQL Debugging Tasks (Easy, Medium, Hard).
Multi-Agent Reviewer Layer.
GRPO Training Script verified on Apple Silicon (M2).
Smoke Test verified (Handshake is 100% working).

⏳ Remaining (For Hackathon Site):

Scale to Qwen 7B/14B on A100 GPUs.
Connect Weights & Biases (WandB) for the live presentation curve.

8. Live Spider Evaluation (The "Ultimate Proof")

How to show the judges your agent can handle real-world academic benchmarks:

Launch the Spider Task: Run /reset with the spider_cross_eval task ID (handled by server/tasks/task_spider.py).
The "Blind Test": Ask a judge to pick a random SQL query from the Spider dev set.
Introduce a Bug: Delete a semicolon, misspell a JOIN, or remove a WHERE clause.
The Demonstration: Run inference.py on that broken Spider query. The Result: The agent will use its trained GRPO weights to analyze the error, inspect the Spider schema, and return the fix.

Why this wins: You are showing that your environment isn't a "closed loop." It can ingest and solve the industry's hardest academic benchmark in real-time.

9. Winning the Q&A (The Cheat Sheet)

Q: "Why SQLite?"

"Because it's the world's most used DB. If the agent can reason in SQLite, it can reason in PostgreSQL. I built a 'Simulator' that is DB-agnostic."

Q: "What makes this 'Multi-Agent'?"

"I have two roles: The Fixer (the LLM) and the Reviewer (the guardrail logic). They interact to ensure every query is safe and syntactically sound before execution."

This manual is your secret weapon. Read it, understand it, and you will own the stage. 🚀