title: GRPO Training Session Log
description: >-
Chronological log of GRPO training runs on Qwen3-0.6B/1.7B covering nine runs,
fixes applied, multi-turn SFT breakthrough, and capacity ceiling analysis
doc_type: exploration
GRPO Training Session Log
Context
Training Qwen3-1.7B as a SQL agent using SFT warmup + GRPO with TRL's environment_factory on Spider dataset. Running on Colab L4 (24GB).
Started 2026-04-02. Multi-turn SFT breakthrough on 2026-04-03.
Key Findings & Fixes Applied
1. SFT Null-Param Injection (ROOT CAUSE of first collapse)
Problem: Qwen3's apply_chat_template expands dict arguments to include ALL parameter names from ALL tools with null values. SFT trained model to always generate {"sql": null, "table_name": "X", "value": null}.
Fix: Pass arguments as JSON strings (json.dumps({"table_name": table})) instead of dicts. Tokenizer uses strings verbatim.
2. SFT Answer Formatting
Problem: Gold answers were Python literals (['a', 'b'], [[1, 'amc']]). Model learned wrong format.
Fix: _format_answer_for_model() converts to human-readable: comma-separated lists, pipe-separated table rows.
3. Empty Tool Responses
Problem: TRL adapter returned observation.result (empty on SQL errors), hiding errors from model.
Fix: _result_or_error() falls back to observation.error so model sees "Error: SQL error: ...".
4. Post-Episode Penalty
Problem: Model continues calling tools after answering, wasting steps with no signal.
Fix: _POST_EPISODE_PENALTY = -0.1 applied in all 4 tool methods when self._done is True.
5. Answer Stripping
Problem: Model wraps answers in quotes, code fences, "Answer:" prefix.
Fix: _strip_answer_wrapping() in verifier preprocesses predicted answers.
6. Per-Turn SFT → Multi-Turn SFT (ROOT CAUSE of Run 5 stall)
Problem: SFT generated one example per assistant turn (347 examples, ~50% describe calls). Model over-learned "call describe" and never practiced query→answer. During GRPO with KL penalty, model stayed anchored to this single-turn policy.
Fix: Generate one full multi-turn example per question (100 examples, each containing describe→query→answer). Enable assistant_only_loss via Qwen3 template patch so loss is on assistant turns only.
Key detail: Qwen3's chat template lacks {% generation %} tags required by TRL for assistant_only_loss. Patch the template before SFT, restore original before GRPO (TRL does exact-match template checks in add_response_schema() and get_training_chat_template()).
7. Removed Arrow-Notation Few-Shot Examples
Problem: System prompt contained few-shot examples using arrow notation (→ describe(table_name="X")) while the model must produce <tool_call>{"name":"describe","arguments":...}</tool_call> JSON. Two competing formats for a 1.7B model.
Fix: Removed _FEW_SHOT_BLOCK from system prompt. The textual "Strategy" section is sufficient.
8. KL Penalty + Curriculum
Problem: GRPO drifted policy away from SFT, causing <tool_response> instead of <tool_call>.
Fix: beta=0.04 KL penalty + easy-first curriculum (phase 1: easy only, phase 2: easy+medium). With multi-turn SFT, beta=0.04 no longer blocks exploration.
9. OOM with Reference Model
Problem: beta>0 loads reference model copy, doubling memory on L4.
Fix: Reduced num_generations 6→4, max_new_tokens 1024→512 for phase 1. Phase 2 drops beta=0 and uses 1024 tokens.
10. generation_batch_size Divisibility
Problem: generation_batch_size (default 8) not divisible by num_generations (6).
Fix: Set generation_batch_size=config.num_generations in notebook_pipeline.
Discovered Issues (not yet fixed)
CTE (WITH clause) rejected by environment
Problem: sql_environment.py SQL validation only allows queries starting with SELECT. The model discovers CTEs during GRPO (WITH dogs AS (...) SELECT ...), gets "Error: Only SELECT queries are allowed. Got: WITH", wastes a step recovering.
Impact: Burns 1-2 steps on error recovery, reducing reward. Teaches model to avoid CTEs even though they're valid read-only SQL.
Root cause: Hard-coded prefix check. The DB is already opened with mode=ro, so SQLite itself would reject writes.
Fix: Allow WITH as a valid query prefix, or remove the prefix check entirely and rely on mode=ro.
Post-episode repetition
Problem: Model keeps calling tools after episode ends (gets {'error': 'Episode is over'}). The -0.1 penalty exists but model still does 3-5 extra calls.
Possible fixes: Increase penalty, or the model may learn to stop as GRPO training progresses.
HF_SUFFIX naming bug (FIXED)
Problem: HF_SUFFIX is concatenated directly onto grpo without auto-prepending a dash. Setting HF_SUFFIX="no-no-thinking" produces sqlenv-qwen3-1.7b-grpono-no-thinking instead of the intended sqlenv-qwen3-1.7b-grpo-no-no-thinking. The grpono-no-thinking checkpoint on HF Hub was manually renamed via HF UI after push.
Root cause: Format string f"sqlenv-{_model_short}-grpo{HF_SUFFIX}" expects the user to include a leading dash.
Fix: Auto-prepend dash + strip existing prefixes from checkpoint names. When resuming from hjerpe/sqlenv-qwen3-0.6b-grpo, the old code produced sqlenv-sqlenv-qwen3-0.6b-grpo-grpo-v2 (double prefix). Now strips sqlenv- and -grpo* from _model_short before rebuilding the name.
Files: notebooks/train_grpo.ipynb save cell.
Save cell uses Phase 1 config for output_dir
Problem: model.save_pretrained(config.output_dir) uses Phase 1's config, not Phase 2's config2. Both phases write to outputs/grpo_run — Phase 2 overwrites Phase 1 checkpoints in the same directory.
Impact: Not a correctness bug (the final model weights are from Phase 2, which is correct), but fragile if you want to preserve Phase 1 checkpoint separately.
Fix: Use config2.output_dir in the save cell, or save Phase 1 to a separate directory before Phase 2 starts.
Training Runs
Run 1 (pre-fixes): SFT OK, GRPO plateau at ~30-40% accuracy
- Model learned tool-calling but rewards flat, advantage=0 most steps
- Identified: no penalty for post-episode, answer format issues
Run 2 (batch 1 fixes): GRPO collapse — null args
- SFT taught
{"sql": null, "table_name": "X", "value": null} - Every rollout got TypeError → reward=0 → no gradient signal
- Root cause: Qwen3 tokenizer expanding dict args
Run 3 (JSON string args fix): GRPO collapse — format drift
- SFT clean, first ~30 steps showed correct tool calls
- By step 40+: model output
<tool_response>instead of<tool_call> - GRPO drifted structural tokens without KL penalty
Run 4 (KL penalty beta=0.04): OOM
- Reference model doubled memory, exceeded L4 24GB
Run 5 (beta=0.04, reduced tokens/generations): KL too conservative
- No collapse, correct format, but reward=0.00 everywhere
- Model only generates single describe call per rollout
- KL penalty keeps model too close to single-turn SFT policy
- All 4 rollouts identical → advantage=0 → no learning
Run 6 (multi-turn SFT + assistant_only_loss): First successful training
- Switched SFT from per-turn (347 examples) to multi-turn (100 full trajectories)
- Enabled
assistant_only_lossvia Qwen3 template patch - Removed arrow-notation few-shot examples from system prompt
- Phase 1 (435 easy, beta=0.04, 512 tokens, ~2h50m):
- Clear upward reward trend: ~0.15 → 0.5-0.75
- Loss trends upward 0→0.14, showing learning from reward signal
- Model writes JOINs, GROUP BY HAVING, NOT IN subqueries, uses
sampletool - Recovers from SQL errors (wrong column → retry, CTE rejected → plain JOIN)
- CTE (WITH) queries rejected by environment — wasted steps
- Phase 2 (467 easy+medium, beta=0, 1024 tokens, ~3h37m):
- Reward holds ~0.5 average, no format collapse without KL
- Peak rewards reach 0.93
- Correct answers on COUNT, AVG, GROUP BY, multi-table JOINs, subqueries
- Medium questions harder — more column-name errors, alias confusion
- Final reward: 0.64
- Persistent issues:
- Error loop: model repeats same failing query without changing it (step 140: "no such column: bonus" 7 times)
- Table alias confusion:
T2.columnwhen column is on T1 - Missing DISTINCT in COUNT queries
- Post-episode repetition: 1-3 extra calls after correct answer
- Empty
<think>blocks — model not reasoning about errors
Changes for Run 7
Applied after Run 6 analysis:
11. Allow CTE (WITH) queries
Fix: Changed SQL validation from first_keyword != "SELECT" to first_keyword not in ("SELECT", "WITH").
Files: server/sql_environment.py (both _execute_gold_sql and _execute_sql)
12. Increase post-episode penalty
Fix: _POST_EPISODE_PENALTY from -0.1 to -0.3. The -0.1 penalty wasn't strong enough — model still made 3-5 extra calls after episode end.
File: training/trl_adapter.py
13. HF Hub suffix for model versioning
Fix: Added HF_SUFFIX parameter to save cell. Set to e.g. "-v2" or "-cte" to push to hjerpe/sqlenv-qwen3-1.7b-grpo-v2.
File: notebooks/train_grpo.ipynb cell 9
Run 7 (repeat penalty + configure fix): Stable reward, multi-table weakness exposed
- Date: 2026-04-05
- Changes: F015 error-repetition penalty (
_REPEAT_PENALTY = -0.2, 3-call deque window), removed publicconfigure()that TRL misidentified as a tool - Branch:
feat/error-repetition-penalty - SFT: 120 multi-turn trajectories, 2 epochs, loss 2.2→0.06, assistant-only loss enabled. 14% assistant tokens. Post-SFT format check: all 3 samples produce correct
<tool_call>JSON withdescribeas first move. - Phase 1 (435 easy, beta=0.04, 512 tokens, ~2h):
- Reward: −0.1 → 0.7 peak, stabilizing 0.3-0.7. Loss spike at step 320 (1.8) recovered.
- Model learned:
describe→query→answer, comma-separated lists, pipe-delimited rows,[]for empty results,UNIONqueries,NOT INsubqueries,LIKE '%North%'. - Repeat penalty observable: step 100 reward −0.22 (model re-described same table), step 120 reward −0.24 with repeat penalty stacking.
- Error recovery improved: after SQL error, model calls
describeon the failing table then retries with correct column names (steps 110, 140). - Persistent: hallucinated column names from pretraining (T_full_name),
ORDER BY count(*) DESCwithoutGROUP BY, CTE queries still rejected.
- Phase 2 (467 easy+medium, beta=0.0, 1024 tokens, ~2h22m):
- Reward oscillated 0.0–1.15, no clear upward trend vs Phase 1. Mean reward ~0.5.
- Single-table questions consistently correct (count, filter, aggregate, WHERE + GROUP BY HAVING).
- Multi-table JOIN weakness: can't follow FK chains (Documents→Templates→Ref_Template_Types), joins on wrong keys, hallucinates join columns.
- Repeat penalty firing on multi-table failures: step 150 reward −0.58 (5+ repeated failed JOINs on
T2.Template_ID). - New behavior: model answers
[]for genuinely empty results, learned"No results"→"[]"mapping. - Step 80 (Phase 2): 1.15 reward, advantage +1.50 — model wrote
SELECT avg(weight), year FROM cars_data GROUP BY yearwith 13-row correct answer in 2 tool calls. Peak efficiency. - Final reward: 0.61.
- Persistent issues:
- Multi-table JOINs: model can't chain through intermediate tables (needs the question-to-FK-path reasoning that 1.7B lacks without thinking)
- Answer hallucination when query returns empty: submits "No data available" or "N/A" instead of trying different query
describerepeat on already-described tables (penalty fires but model still does it)- Step 430: hex-encoded query string (
0x45636365646965...) — degenerate output near end of training
Run 8 (thinking mode): Thinking helps error recovery but introduces degenerate loop
- Date: 2026-04-06
- Changes: F012
enable_thinkingconfig flag,ENABLE_THINKING = Truein notebook, max_new_tokens 768 (Phase 1) / 1280 (Phase 2) - Branch:
feat/enable-thinking-mode - SFT: Same 120 multi-turn trajectories as Run 7, but system prompt omits
/no_thinkprefix. SFT data itself has no<think>blocks (approach B: let GRPO discover thinking). - Phase 1 (435 easy, beta=0.04, 768 tokens, ~4.5h):
- Loss 0.31→oscillating 0.05-0.40 throughout. No clear trend.
- Correct answers on ~50% of sampled steps (reward 1.15). Similar to Run 7 on easy questions.
- Thinking triggers on errors: Step 90 — after 2 SQL errors (
no such column: airport_code), model opens<think>, reasons about column name mismatch, then generates correctAirportCodequery. Step 180 — reasons aboutcourse_titlevscourse_nameafter error, corrects to right column. - Empty think blocks for easy questions: Steps 20-80 all show
<think></think>with no content — model skips thinking when confident. Good token efficiency. - NEW failure mode:
<think>assistantdegenerate loop — ~10/43 sampled steps (23%) show<think>assistant<think>assistant...repeating until token limit. Model fails to close</think>and enters repetitive pattern. Steps 110, 140, 200, 260, 300, 340, 410, 420, 430 all exhibit this. Burns entire token budget with no useful output. - Multi-table JOINs with subqueries work (Step 30:
NOT INsubquery, Step 80: UNION, Step 435: correlated subquery with HAVING). - Final step 435: model writes complex correlated subquery with
HAVING count(*) = (SELECT ... ORDER BY count(*) DESC LIMIT 1)— correct answer "Martin".
- Phase 2 (467 easy+medium, beta=0.0, 1280 tokens, stopped at step 182/467 — likely OOM):
- Reward oscillated 0.1-0.85, averaging
0.45. Comparable to Run 7 Phase 2 (0.5). - Step 10: Easy question solved in 3 tool calls (describe→query→answer). Reward 1.15.
- Step 90: Multi-table JOIN with
HAVING count(*) < 200— correct, reward 1.15. - Step 110:
NOT INsubquery for stadiums without concerts — correct on first try. - Step 140: Cross-table JOIN (evaluation + employee,
MAX(bonus)) — correct. - Step 150: Multi-table chain reasoning with thinking — corrected
Document_Name→Template_IDjoin path after 2 errors. Long<think>block with correct reasoning. - Step 170: Double-year intersection query (
Stadium_ID IN ... 2014 AND Stadium_ID IN ... 2015) — correct. - Crashed at step 182 — likely OOM from 1280 max_new_tokens + thinking blocks consuming more memory during generation.
- Model checkpoint was NOT pushed to HF Hub before crash.
- Reward oscillated 0.1-0.85, averaging
- Persistent issues:
<think>assistantdegenerate loop (~23% of Phase 1 steps) — new failure mode unique to thinking mode- Multi-table FK chain queries still fail on medium difficulty (same as Run 7)
- Phase 2 no better than Run 7's Phase 2 — thinking mode doesn't help with the fundamental JOIN reasoning gap
Run 9 (v2 continued training, no-think): Confirms Phase 2 ceiling
- Date: 2026-04-11
- Changes: Resumed from v1 checkpoint (Run 7's final weights), 2 epochs Phase 1 + 2 epochs Phase 2. Fixed model preset lookup (
_get_preset()matching on "1.7b" in name string instead of exact.get()). - Branch:
feat/f011-3-way-comparison-notebook - Phase 1 (435 easy, beta=0.04, 512 tokens, ~3h34m, 870 steps):
- Loss: oscillates 0.01-0.13, occasional negatives (-0.05) in second half. More negative values than v1 Phase 1 — expected since starting from trained checkpoint, less to learn.
- Rewards: sawtooth 0.01-1.15. Easy questions solved reliably (describe→query→answer in 3 calls). Medium questions from mixed batches still fail.
- Model behavior: solid tool-call format, comma-separated lists, pipe-delimited rows. No format collapse.
- Step 300: Degenerate SQL —
ORDER BY HorsepowerDESC(missing space), repeated 3 times. Token budget consumed. - Step 560: Degenerate completion — output "icher Consulting Solution" (truncated gibberish). Reward 0.00. One-off.
- Phase 2 (467 easy+medium, beta=0.0, 1024 tokens, ~3h50m, 934 steps):
- Loss: oscillates -0.13 to +0.12, trend more negative than Phase 1 — policy sharpens on known patterns without KL regularization.
- Rewards: same sawtooth 0.01-1.15 as Phase 1, no upward trend. Mean ~0.5.
- Successes (medium): Step 140 — JOINed evaluation→employee for MAX(bonus), found "Louis Deacon" (1.13 reward). Step 750 — subquery
COUNT(*) > (SELECT ... ORDER BY Horsepower DESC LIMIT 1), answered "39" correctly. - Failures (medium): Step 20 — hallucinated
make_id,full_namecolumns, budget exhausted after 8+ tool calls. Step 50 — inventedCourse_Attendancetable, cascading errors. Step 530 — triedBred,Breedbefore findingBreeds, then queried wrong column. - Persistent pattern: Model describes tables correctly but writes SQL with wrong column names from pretraining knowledge (e.g.,
full_nameinstead ofFullName,country.namewhen table issingerwithCountrycolumn). - Final reward: 0.048 (last step was incorrect)
- Charts: Reward Trend (Phase 1→2) shows flat continuation — no improvement from adding medium questions. Loss in Phase 2 oscillates around 0, with spikes to -0.13 (GRPO reinforcing already-known easy patterns).
- Conclusion: v2 confirms v1 findings. The 0.6B model's accuracy ceiling is set by pretraining SQL knowledge, not RL training budget. More epochs don't help medium questions. Next interventions: (1) more SFT on multi-table JOINs with correct column names, (2) larger model (1.7B), or (3) increase step budget to let model iterate.
Eval Format Fix (F011 comparison notebook)
- Date: 2026-04-10
- Problem:
compare_methods.ipynbeval fed models a different message format than TRL training:- Tool results posted as
role: "user"— training usesrole: "tool"(Qwen3 renders as<tool_response>wrapper) - Assistant turns stored as raw text content — training uses structured
tool_callsdicts with JSON-string arguments - Question + table hint separated by
\n\n— TRL appendsreset()return directly to user message (no separator)
- Tool results posted as
- Discovery method: Added debug cell to render prompts via
apply_chat_templateand compared side-by-side with TRL training log output. Therole: "tool"format renders as<|im_start|>user\n<tool_response>...</tool_response>whilerole: "user"renders as<|im_start|>user\nplain text— structurally different despite both appearing underusertoken. - Fix: Changed
LLMToolCallingPolicyin compare_methods.ipynb to match TRL exactly: structuredtool_calls,role: "tool", concatenated user message. Also parse ALL<tool_call>blocks per generation and buffer extras (matches TRL's_tool_call_loop). - Result (N=50, base=Qwen3-0.6B, 2026-04-11, with parse-failure retry, 2 runs):
- Run A:
- zero-shot: 0% accuracy, 28% parse rate, avg 10.8 steps (31/50 budget exhaust)
- 1-shot: 0% accuracy, 16% parse rate, avg 14.8 steps (49/50 budget exhaust)
- 3-shot: 0% accuracy, 20% parse rate, avg 13.8 steps (44/50 budget exhaust)
- grpo-v1: 28% accuracy, 95% parse rate, avg 4.0 steps, avg reward 0.355
- grpo-v2: 32% accuracy, 87% parse rate, avg 3.7 steps, avg reward 0.400
- Run B (same day, different Colab session):
- zero-shot: 0% accuracy, 24% parse rate, avg 12.4 steps (38/50 budget exhaust)
- 1-shot: 2% accuracy, 17% parse rate, avg 14.0 steps (46/50 budget exhaust)
- 3-shot: 0% accuracy, 19% parse rate, avg 14.8 steps (49/50 budget exhaust)
- grpo-v1: 30% accuracy, 100% parse rate, avg 3.5 steps, avg reward 0.386
- grpo-v2: 24% accuracy, 95% parse rate, avg 3.6 steps, avg reward 0.321
- Run-to-run variation: v1 scored 28% then 30%, v2 scored 32% then 24%. The
6-8pp swing confirms v1 and v2 are statistically indistinguishable at N=50. Report as "30% accuracy" for both. - Parse failure retry: base models no longer die on first parse failure — they get a no-op DESCRIBE and continue. This reveals they waste their entire 15-step budget repeating the same malformed output.
- Base model failure mode: can't produce
<tool_call>format (76-83% parse failure rate). GRPO failure mode: produces valid tool calls but writes wrong SQL. - 1-shot scored 2% in Run B (1 lucky episode) — demonstrates N=50 noise floor for rare events.
- Run A:
- Checkpoint naming:
grpono-no-thinkingwas caused byHF_SUFFIX="no-no-thinking"(missing leading dash) and subsequent HF UI rename. See "Discovered Issues" section. - TRL format verified from source:
reset()return is appended to last user message (TRL docs + grpo_trainer.py). Tool results use{"role": "tool", "name": name, "content": result}. Generation runs to EOS (no stop at</tool_call>), all parsed tool calls executed in sequence.
Current Status (after Run 9)
Working:
- Multi-turn SFT +
assistant_only_loss— still the critical foundation - GRPO learns on easy questions: reward −0.1→0.7 in Phase 1 (both Run 7 and 8)
- Repeat penalty (F015) fires correctly on exact-match repeated calls
- Error recovery: describe→retry after SQL error is a learned behavior
- Answer format: single values, comma-separated lists, pipe-delimited rows,
[]for empty - Thinking mode triggers on errors — model reasons about column name mismatches and table structure after SQL errors (Steps 90, 150, 180, 220, 280 in Run 8)
- Empty think blocks for easy questions — model doesn't waste tokens thinking when confident
Not yet working:
- Multi-table FK chain queries (medium difficulty) — confirmed across Runs 7, 8, 9. More RL epochs don't help.
- Phase 2 shows no improvement over Phase 1 — medium questions need more SFT coverage on JOIN patterns
- Column name hallucination from pretraining — model reads schema correctly then writes pretrained column names
- Model doesn't use
sampletool (learned in Run 6 but lost?) <think>assistantdegenerate loop — thinking mode (Run 8) introduces ~23% failure rate from unclosed think tags
For comparison notebook (F011):
- v1 checkpoint on HF Hub:
hjerpe/sqlenv-qwen3-0.6b-grpo - v2 checkpoint on HF Hub:
hjerpe/sqlenv-qwen3-0.6b-grpo-v2 - Run 8 (thinking) checkpoint was NOT pushed — Colab session crashed before save
- N=50 eval completed 2026-04-11 (2 runs): v1 ~28-30%, v2 ~24-32%, confirming both are ~30% and within run-to-run noise
- v1 and v2 are statistically indistinguishable — the difference between runs is larger than the difference between checkpoints
- Thinking mode comparison can be added later when a checkpoint is available
Possible next interventions:
- Thinking mode training (0.6B): Resume from v1 with
ENABLE_THINKING=True, push as-thinksuffix. Run 8 showed thinking helps error recovery but crashed before save. - More SFT on multi-table JOINs: Add trajectories with 3+ table chains, correct column names after describe. Highest priority — v2 proved more RL epochs don't help without this.
- Increase model size: Switch from 0.6B to 1.7B. Larger model may override pretrained column name biases from schema context.
OOM prevention for next thinking-mode run:
The Run 8 Phase 2 crash at step 182/467 was likely OOM. Root causes and mitigations:
max_new_tokens=1280is too high for L4 with thinking — medium questions trigger long<think>blocks (Step 50 reasoning about>1vs>=1, Step 120 about breed/size format, Step 130 aboutT1.distinct_citycolumn mismatch). Reduce to 1024 for Phase 2.num_generations=4compounds the problem — each generation runs inference independently, so 4 rollouts × 1280 tokens = 5120 tokens of peak generation memory. Reduce to 3 generations for thinking-mode Phase 2. Thegeneration_batch_sizemust also be updated to match.- The
<think>assistantdegenerate loop inflates effective token usage — a rollout that enters the loop consumes the fullmax_new_tokensbudget producing garbage. Fixing this loop via SFT (adding 5-10 examples with proper<think>reasoning</think>blocks) would reduce average token consumption significantly, making OOM less likely even at higher token limits. - Phase 2 has no KL reference model (beta=0) — so memory is only model + generation buffers. The OOM is purely from generation length, not model copies.
Recommended config for next thinking-mode run (Phase 2):
config2 = replace(config,
beta=0.0,
max_new_tokens=1024, # was 1280
num_generations=3, # was 4
enable_thinking=True,
)
Also set generation_batch_size=3 in notebook_pipeline.py (it must equal num_generations).
Historical: Status after Run 6
Architecture decisions to preserve:
- Multi-turn SFT with
assistant_only_loss— critical over per-turn - Qwen3 template patch (
{% generation %}tags) for SFT, restore original before GRPO - SFT args as JSON strings (not dicts) — critical for Qwen3
- Phase 1 (easy, KL) → Phase 2 (easy+medium, no KL)
- DB opened with
mode=ro— safety enforced by SQLite, not regex
File Map
| File | What changed |
|---|---|
scripts/generate_sft_data.py |
Multi-turn trajectories, JSON string args, answer formatting |
scripts/inspect_sft_data.py |
SFT data stats + tokenizer-rendered inspection |
training/trl_adapter.py |
Post-episode penalty (-0.3), error surfacing, _result_or_error |
training/config.py |
Added beta field (KL penalty) |
training/notebook_pipeline.py |
generation_batch_size, beta passthrough |
server/verifier.py |
_strip_answer_wrapping preprocessing |
server/sql_environment.py |
SQL validation allows SELECT and WITH |
notebooks/train_grpo.ipynb |
Multi-turn SFT, assistant_only_loss, template patch/restore, HF_SUFFIX |
Key Learnings
- Qwen3's apply_chat_template expands dict args — always use JSON strings for SFT tool_call arguments.
- Multi-turn SFT is critical for agentic GRPO — per-turn examples teach one action; the model never learns the full workflow. Full trajectory SFT with
assistant_only_lossteaches describe→query→answer as a coherent strategy. - Qwen3 template lacks {% generation %} tags — patch before SFT for
assistant_only_loss, restore before GRPO. TRL'sadd_response_schema()andget_training_chat_template()do exact string equality on the template. - Don't show competing formats to small models — arrow-notation few-shot examples confused the model when it needed to produce
<tool_call>JSON. - KL penalty effectiveness depends on SFT quality — beta=0.04 was "too high" only because the SFT policy was single-turn. With multi-turn SFT, the same beta works fine.
- Reference model doubles memory — plan for this when using KL penalty on L4.
- Let the SQL engine enforce safety, not regex — hard-coded
SELECT-only prefix check blocks valid read-only SQL (CTEs). The DB is alreadymode=ro. - Render training data through the actual tokenizer — inspect scripts that reformat JSON are fragile. The ground truth is
apply_chat_templateoutput from the same tokenizer instance used for training. - Error loops are a 1.7B capacity limit — the model repeats failing queries verbatim because
<think>is suppressed and it can't reason about the error. Enabling thinking mode may help. - Post-episode penalty of -0.1 is too weak — model still makes 3-5 extra calls. Increased to -0.3.
- Repeat penalty works but doesn't fix root cause — the −0.2 penalty fires correctly on exact-match repeated tool calls, but the model's real problem is pretrained column-name hallucination, not repetition per se. The model varies its queries enough to avoid exact repeats while still failing on the same conceptual error.
- Phase 2 (medium) doesn't improve over Phase 1 (easy) — reward plateau at ~0.5 suggests the model needs more SFT coverage on multi-table JOINs, not just more GRPO steps. RL can't teach FK chain reasoning that isn't in the initial policy.
- Thinking mode helps error recovery but doesn't improve overall accuracy — the model uses
<think>blocks to reason about SQL errors (column name mismatches, table structure), leading to correct retries. But accuracy on easy questions is similar to no-think Run 7. The benefit is qualitative (better error recovery) not quantitative (higher reward). <think>assistantdegenerate loop is a new failure mode — ~23% of thinking-mode steps degenerate into<think>assistant<think>assistant...repeating until token limit. The model fails to produce</think>and enters a repetitive pattern. This is the thinking-mode equivalent of Run 7's post-episode repetition. Fix: add SFT examples with proper<think>reasoning</think>blocks.- Empty
<think></think>blocks are good — the model learns to skip thinking on easy questions, preserving tokens for tool calls. This is emergent behavior from GRPO reward signal (thinking wastes tokens → lower reward on easy questions). - 1280 max_new_tokens is too aggressive for thinking mode on L4 — Phase 2 crashed at step 182/467, likely OOM. The longer
<think>blocks in Phase 2 (medium questions trigger more reasoning) push memory past L4's 24GB. Use 1024 max_new_tokens for thinking-mode Phase 2. - Public methods on environment_factory become TRL tools — TRL introspects all public methods for JSON schema generation. The
configure()classmethod caused aDocstringParsingException. Keep configuration methods private (_configure). - Continued training from checkpoint doesn't unlock medium questions — v2 ran 2 more epochs of Phase 1 + Phase 2 from v1's final checkpoint. Reward stayed flat at ~0.5 mean. The model reliably solves easy single-table queries but can't learn multi-table FK chain reasoning from RL alone. The policy needs SFT coverage on the patterns it can't discover through trial-and-error.
- Column name hallucination is the dominant error mode — the model describes tables correctly (seeing
FullName: TEXT) then writesSELECT full_nameorSELECT Maker, FullName FROM car_makers ORDER BY MakerDESC LIMIT 1(missing space). This is pretrained SQLese overriding the schema information the model just read. A 0.6B model can't override pretraining biases through RL reward signal alone. - Eval must exactly match TRL's message format —
role:"tool"for env results (notrole:"user"), structuredtool_callsdicts for assistant turns (not raw<tool_call>text in content), question+table_hint concatenated without separator (TRL appendsreset()return to last user message). Qwen3 rendersrole:"tool"as<|im_start|>user\n<tool_response>...</tool_response>— looks like a user message but is structurally different. Getting this wrong caused 0% accuracy across all conditions; fixing it recovered 10-50% on base model. - Incorrect answer reward of 0.0 creates an avoid-answering incentive — exploration steps accumulate
0.05-0.15 reward. Calling0.05) can be lower than not answering and exploring until budget (~0.10). The model may learn to write prose instead of callinganswer(wrong)gives 0.0 and ends the episode, so total reward (answer()when uncertain. PRS (Progressive Reward Shaping, arxiv 2512.07478) addresses this with a small format-compliance reward for completing the tool pipeline regardless of correctness. - Continued training trades guessing for abstention — v2 outputs "Task complete." instead of calling
answer()on hard questions — a form of calibrated uncertainty. v1 guesses more but gets fewer right per attempt. The 0.0 incorrect-answer reward (learning #21) drives this: v2 internalized that guessing wrong is worse than not answering. - v1 and v2 are statistically indistinguishable at N=50 — across two runs, v1 scored 28% then 30%, v2 scored 32% then 24%. The
6-8pp run-to-run variation exceeds the checkpoint difference. v2's abstention behavior (learning #22) adds variance: on borderline questions, whether v2 guesses or outputs "Task complete." varies by run. For reporting, use "30% accuracy" for both checkpoints. N=200+ would be needed to detect a real 4pp difference with 80% power.