sql_env / docs /exploration /train-grpo-walkthrough.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified
metadata
title: Training Walkthrough -- GRPO + SQLEnv on Colab
description: Step-by-step companion guide for the train_grpo.ipynb notebook
type: prototype
doc_type: exploration

Training Walkthrough: GRPO + SQLEnv on Colab

Companion guide for notebooks/train_grpo.ipynb. Read each section before running the corresponding cell.


Cell 1-2: Setup

Mental model: You're building a workbench. Colab gives you a bare machine with a GPU. This cell installs your tools, clones your project, and downloads the 10 SQLite databases the agent will practice on.

What happens: The cell detects Colab via google.colab in sys.modules. If true, it runs pip install .[training] (pulls TRL, Transformers, PyTorch, accelerate) and git clone to get data files. Then it runs download_spider_databases.py, which fetches ~50MB of Spider SQLite files into data/databases/.

What to watch for:

  • The install takes 1-2 minutes. No errors should appear.
  • Output ends with the project root path and "Running on: Colab."

Cell 3-4: Configuration

Mental model: You're setting the dials on your training rig. Model size, batch size, how many candidate answers to generate per question, how many steps the agent gets per episode.

What happens: Creates a GRPOConfig dataclass with:

  • Qwen/Qwen3-0.6B as the model (small enough for free Colab GPU)
  • num_generations=2 β€” GRPO generates 2 completions per prompt, ranks them by reward, and updates the policy toward the better one
  • step_budget=10 β€” each episode allows 10 actions before forced termination
  • Imports TRL's GRPOTrainer class. If this cell fails, the Setup cell didn't install TRL properly.

Cell 5-6: Smoke Test

Mental model: You're checking that the database works before you start training. Like turning the ignition before a road trip.

What happens: Creates an SQLEnvironment in-process (no server needed). Loads 473 training questions, picks one with seed=42, runs a DESCRIBE action. If you see "Smoke test passed," the environment, questions, and databases all work.

What to watch for:

  • "Loaded 473 questions" β€” confirms data files loaded
  • A question and schema snippet β€” confirms SQLite databases are accessible

Cell 7-8: Training

Mental model: This is the road trip. The agent sees questions, explores databases by generating action sequences, and gets scored. GRPO compares multiple attempts at each question and nudges the model toward better strategies. Think of it as the agent playing 473 mini-games, getting a score each time, and gradually learning which moves lead to higher scores.

What happens, step by step:

  1. load_model_and_tokenizer downloads Qwen3-0.6B (~1.2GB) from HuggingFace
  2. load_question_prompts converts the 473 questions into prompt strings
  3. sample_random_baseline runs 8 episodes with random actions to establish a baseline reward
  4. build_trainer wires the model, prompts, and three reward functions into TRL's GRPOTrainer
  5. run_training_with_metrics runs the training loop, collecting (step, reward) pairs

The three reward functions:

  • reward_correctness β€” did the agent get the right answer? (+1.0 or 0.0)
  • reward_progress β€” are intermediate queries getting closer to the gold answer?
  • reward_operational β€” did actions execute without errors? Did the agent explore new information?

What to watch for:

  • Model download takes 1-2 minutes on first run

Cell 5: Live Visualization Setup

Mental model: You're hiring a coach who watches the training session, takes notes, and shows you charts. The coach tracks two groups of students: one studying the same material (train), one given new problems (eval). If the train group improves but the eval group doesn't, the model is memorizing instead of learning.

What happens: Creates two SQLEnvironment instances, one loaded with training questions (473), one with evaluation questions (203). Creates LiveVisualizationCallback configured to:

  • Update the reward plot every 10 steps (cheap, reads logged metrics)
  • Print one execution trace every 50 steps (runs one full episode with the current model)
  • Run 3 episodes on each split every 100 steps and plot train-vs-eval success rates

Cell 6: Train with GRPO

Mental model: The road trip, now with a dashboard. Instead of driving blind, you see the speedometer (reward curve), hear the engine (execution traces), and watch the fuel gauge (train-vs-eval gap).

What happens: Same as before (GRPO training loop), but the viz callback fires at each step:

  • Every 10 steps: the left plot refreshes with the latest reward value
  • Every 50 steps: a trace prints below the plot showing one episode: the question, each action the model chose, and whether it got the answer right
  • Every 100 steps: the right plot appears showing train success rate vs eval success rate

What to watch for:

  • Reward plot rising = the model learns to get higher rewards
  • Traces showing structure = early traces are random gibberish; later traces should show DESCRIBE before QUERY, and attempts at real SQL
  • Train and eval curves tracking together = healthy generalization
  • Train rising, eval flat = overfitting; the model memorizes training questions
  • Each step takes 30-60 seconds on a T4 GPU
  • With num_train_epochs=1, expect 10-30 minutes total

Cell 7: Final Summary

Mental model: The report card. A clean static plot of the full training run for screenshots and the blog post.

What happens: Plots the (step, reward) pairs collected during training.

What to watch for:

  • Upward trend = learning signal exists, the dense reward design works
  • Noisy but upward = normal for RL with small models
  • Flat = something is wrong (reward functions not providing signal, or model too small to learn the task)

Interpreting Results

The goal of this run is not a production agent. It's evidence that:

  1. The environment provides learnable signal. Random baseline scores ~0.25 reward, 0% success. Any improvement over that proves the dense reward design works.
  2. A sub-0.5B model can learn multi-step tool use. Even modest reward improvement demonstrates that small models can acquire exploration strategies through RL.
  3. The pipeline runs end-to-end on free hardware. Colab T4, no paid APIs, no external server.

These three points map directly to blog outline sections 5 and 6.