Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /exploration /grpo-training-session-log.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 6 days ago

preview code

raw

history blame contribute delete

32.5 kB

metadata

title: GRPO Training Session Log
description: >-
  Chronological log of GRPO training runs on Qwen3-0.6B/1.7B covering nine runs,
  fixes applied, multi-turn SFT breakthrough, and capacity ceiling analysis
doc_type: exploration

GRPO Training Session Log

Context

Training Qwen3-1.7B as a SQL agent using SFT warmup + GRPO with TRL's environment_factory on Spider dataset. Running on Colab L4 (24GB).

Started 2026-04-02. Multi-turn SFT breakthrough on 2026-04-03.

Key Findings & Fixes Applied

1. SFT Null-Param Injection (ROOT CAUSE of first collapse)

Problem: Qwen3's apply_chat_template expands dict arguments to include ALL parameter names from ALL tools with null values. SFT trained model to always generate {"sql": null, "table_name": "X", "value": null}. Fix: Pass arguments as JSON strings (json.dumps({"table_name": table})) instead of dicts. Tokenizer uses strings verbatim.

2. SFT Answer Formatting

Problem: Gold answers were Python literals (['a', 'b'], [[1, 'amc']]). Model learned wrong format. Fix: _format_answer_for_model() converts to human-readable: comma-separated lists, pipe-separated table rows.

3. Empty Tool Responses

Problem: TRL adapter returned observation.result (empty on SQL errors), hiding errors from model. Fix: _result_or_error() falls back to observation.error so model sees "Error: SQL error: ...".

4. Post-Episode Penalty

Problem: Model continues calling tools after answering, wasting steps with no signal. Fix: _POST_EPISODE_PENALTY = -0.1 applied in all 4 tool methods when self._done is True.

5. Answer Stripping

Problem: Model wraps answers in quotes, code fences, "Answer:" prefix. Fix: _strip_answer_wrapping() in verifier preprocesses predicted answers.

6. Per-Turn SFT → Multi-Turn SFT (ROOT CAUSE of Run 5 stall)

Problem: SFT generated one example per assistant turn (347 examples, ~50% describe calls). Model over-learned "call describe" and never practiced query→answer. During GRPO with KL penalty, model stayed anchored to this single-turn policy. Fix: Generate one full multi-turn example per question (100 examples, each containing describe→query→answer). Enable assistant_only_loss via Qwen3 template patch so loss is on assistant turns only. Key detail: Qwen3's chat template lacks {% generation %} tags required by TRL for assistant_only_loss. Patch the template before SFT, restore original before GRPO (TRL does exact-match template checks in add_response_schema() and get_training_chat_template()).

7. Removed Arrow-Notation Few-Shot Examples

Problem: System prompt contained few-shot examples using arrow notation (→ describe(table_name="X")) while the model must produce <tool_call>{"name":"describe","arguments":...}</tool_call> JSON. Two competing formats for a 1.7B model. Fix: Removed _FEW_SHOT_BLOCK from system prompt. The textual "Strategy" section is sufficient.

8. KL Penalty + Curriculum

Problem: GRPO drifted policy away from SFT, causing <tool_response> instead of <tool_call>. Fix: beta=0.04 KL penalty + easy-first curriculum (phase 1: easy only, phase 2: easy+medium). With multi-turn SFT, beta=0.04 no longer blocks exploration.

9. OOM with Reference Model

Problem: beta>0 loads reference model copy, doubling memory on L4. Fix: Reduced num_generations 6→4, max_new_tokens 1024→512 for phase 1. Phase 2 drops beta=0 and uses 1024 tokens.

10. generation_batch_size Divisibility

Problem: generation_batch_size (default 8) not divisible by num_generations (6). Fix: Set generation_batch_size=config.num_generations in notebook_pipeline.

Discovered Issues (not yet fixed)

CTE (WITH clause) rejected by environment

Problem: sql_environment.py SQL validation only allows queries starting with SELECT. The model discovers CTEs during GRPO (WITH dogs AS (...) SELECT ...), gets "Error: Only SELECT queries are allowed. Got: WITH", wastes a step recovering. Impact: Burns 1-2 steps on error recovery, reducing reward. Teaches model to avoid CTEs even though they're valid read-only SQL. Root cause: Hard-coded prefix check. The DB is already opened with mode=ro, so SQLite itself would reject writes. Fix: Allow WITH as a valid query prefix, or remove the prefix check entirely and rely on mode=ro.

Post-episode repetition

Problem: Model keeps calling tools after episode ends (gets {'error': 'Episode is over'}). The -0.1 penalty exists but model still does 3-5 extra calls. Possible fixes: Increase penalty, or the model may learn to stop as GRPO training progresses.

HF_SUFFIX naming bug (FIXED)

Problem: HF_SUFFIX is concatenated directly onto grpo without auto-prepending a dash. Setting HF_SUFFIX="no-no-thinking" produces sqlenv-qwen3-1.7b-grpono-no-thinking instead of the intended sqlenv-qwen3-1.7b-grpo-no-no-thinking. The grpono-no-thinking checkpoint on HF Hub was manually renamed via HF UI after push. Root cause: Format string f"sqlenv-{_model_short}-grpo{HF_SUFFIX}" expects the user to include a leading dash. Fix: Auto-prepend dash + strip existing prefixes from checkpoint names. When resuming from hjerpe/sqlenv-qwen3-0.6b-grpo, the old code produced sqlenv-sqlenv-qwen3-0.6b-grpo-grpo-v2 (double prefix). Now strips sqlenv- and -grpo* from _model_short before rebuilding the name. Files: notebooks/train_grpo.ipynb save cell.

Save cell uses Phase 1 config for output_dir

Problem: model.save_pretrained(config.output_dir) uses Phase 1's config, not Phase 2's config2. Both phases write to outputs/grpo_run — Phase 2 overwrites Phase 1 checkpoints in the same directory. Impact: Not a correctness bug (the final model weights are from Phase 2, which is correct), but fragile if you want to preserve Phase 1 checkpoint separately. Fix: Use config2.output_dir in the save cell, or save Phase 1 to a separate directory before Phase 2 starts.

Training Runs

Run 1 (pre-fixes): SFT OK, GRPO plateau at ~30-40% accuracy

Model learned tool-calling but rewards flat, advantage=0 most steps
Identified: no penalty for post-episode, answer format issues

Run 2 (batch 1 fixes): GRPO collapse — null args

SFT taught {"sql": null, "table_name": "X", "value": null}
Every rollout got TypeError → reward=0 → no gradient signal
Root cause: Qwen3 tokenizer expanding dict args

Run 3 (JSON string args fix): GRPO collapse — format drift

SFT clean, first ~30 steps showed correct tool calls
By step 40+: model output <tool_response> instead of <tool_call>
GRPO drifted structural tokens without KL penalty

Run 4 (KL penalty beta=0.04): OOM

Reference model doubled memory, exceeded L4 24GB

Run 5 (beta=0.04, reduced tokens/generations): KL too conservative

No collapse, correct format, but reward=0.00 everywhere
Model only generates single describe call per rollout
KL penalty keeps model too close to single-turn SFT policy
All 4 rollouts identical → advantage=0 → no learning

Run 6 (multi-turn SFT + assistant_only_loss): First successful training

Switched SFT from per-turn (347 examples) to multi-turn (100 full trajectories)
Enabled assistant_only_loss via Qwen3 template patch
Removed arrow-notation few-shot examples from system prompt
Phase 1 (435 easy, beta=0.04, 512 tokens, ~2h50m):
- Clear upward reward trend: ~0.15 → 0.5-0.75
- Loss trends upward 0→0.14, showing learning from reward signal
- Model writes JOINs, GROUP BY HAVING, NOT IN subqueries, uses sample tool
- Recovers from SQL errors (wrong column → retry, CTE rejected → plain JOIN)
- CTE (WITH) queries rejected by environment — wasted steps
Phase 2 (467 easy+medium, beta=0, 1024 tokens, ~3h37m):
- Reward holds ~0.5 average, no format collapse without KL
- Peak rewards reach 0.93
- Correct answers on COUNT, AVG, GROUP BY, multi-table JOINs, subqueries
- Medium questions harder — more column-name errors, alias confusion
- Final reward: 0.64
Persistent issues:
- Error loop: model repeats same failing query without changing it (step 140: "no such column: bonus" 7 times)
- Table alias confusion: T2.column when column is on T1
- Missing DISTINCT in COUNT queries
- Post-episode repetition: 1-3 extra calls after correct answer
- Empty <think> blocks — model not reasoning about errors

Changes for Run 7

Applied after Run 6 analysis:

11. Allow CTE (WITH) queries

Fix: Changed SQL validation from first_keyword != "SELECT" to first_keyword not in ("SELECT", "WITH"). Files: server/sql_environment.py (both _execute_gold_sql and _execute_sql)

12. Increase post-episode penalty

Fix: _POST_EPISODE_PENALTY from -0.1 to -0.3. The -0.1 penalty wasn't strong enough — model still made 3-5 extra calls after episode end. File: training/trl_adapter.py

13. HF Hub suffix for model versioning

Fix: Added HF_SUFFIX parameter to save cell. Set to e.g. "-v2" or "-cte" to push to hjerpe/sqlenv-qwen3-1.7b-grpo-v2. File: notebooks/train_grpo.ipynb cell 9

Run 7 (repeat penalty + configure fix): Stable reward, multi-table weakness exposed

Date: 2026-04-05
Changes: F015 error-repetition penalty (_REPEAT_PENALTY = -0.2, 3-call deque window), removed public configure() that TRL misidentified as a tool
Branch: feat/error-repetition-penalty
SFT: 120 multi-turn trajectories, 2 epochs, loss 2.2→0.06, assistant-only loss enabled. 14% assistant tokens. Post-SFT format check: all 3 samples produce correct <tool_call> JSON with describe as first move.
Phase 1 (435 easy, beta=0.04, 512 tokens, ~2h):
- Reward: −0.1 → 0.7 peak, stabilizing 0.3-0.7. Loss spike at step 320 (1.8) recovered.
- Model learned: describe → query → answer, comma-separated lists, pipe-delimited rows, [] for empty results, UNION queries, NOT IN subqueries, LIKE '%North%'.
- Repeat penalty observable: step 100 reward −0.22 (model re-described same table), step 120 reward −0.24 with repeat penalty stacking.
- Error recovery improved: after SQL error, model calls describe on the failing table then retries with correct column names (steps 110, 140).
- Persistent: hallucinated column names from pretraining (T_full_name), ORDER BY count(*) DESC without GROUP BY, CTE queries still rejected.
Phase 2 (467 easy+medium, beta=0.0, 1024 tokens, ~2h22m):
- Reward oscillated 0.0–1.15, no clear upward trend vs Phase 1. Mean reward ~0.5.
- Single-table questions consistently correct (count, filter, aggregate, WHERE + GROUP BY HAVING).
- Multi-table JOIN weakness: can't follow FK chains (Documents→Templates→Ref_Template_Types), joins on wrong keys, hallucinates join columns.
- Repeat penalty firing on multi-table failures: step 150 reward −0.58 (5+ repeated failed JOINs on T2.Template_ID).
- New behavior: model answers [] for genuinely empty results, learned "No results" → "[]" mapping.
- Step 80 (Phase 2): 1.15 reward, advantage +1.50 — model wrote SELECT avg(weight), year FROM cars_data GROUP BY year with 13-row correct answer in 2 tool calls. Peak efficiency.
- Final reward: 0.61.
Persistent issues:
- Multi-table JOINs: model can't chain through intermediate tables (needs the question-to-FK-path reasoning that 1.7B lacks without thinking)
- Answer hallucination when query returns empty: submits "No data available" or "N/A" instead of trying different query
- describe repeat on already-described tables (penalty fires but model still does it)
- Step 430: hex-encoded query string (0x45636365646965...) — degenerate output near end of training

Run 8 (thinking mode): Thinking helps error recovery but introduces degenerate loop

Date: 2026-04-06
Changes: F012 enable_thinking config flag, ENABLE_THINKING = True in notebook, max_new_tokens 768 (Phase 1) / 1280 (Phase 2)
Branch: feat/enable-thinking-mode
SFT: Same 120 multi-turn trajectories as Run 7, but system prompt omits /no_think prefix. SFT data itself has no <think> blocks (approach B: let GRPO discover thinking).
Phase 1 (435 easy, beta=0.04, 768 tokens, ~4.5h):
- Loss 0.31→oscillating 0.05-0.40 throughout. No clear trend.
- Correct answers on ~50% of sampled steps (reward 1.15). Similar to Run 7 on easy questions.
- Thinking triggers on errors: Step 90 — after 2 SQL errors (no such column: airport_code), model opens <think>, reasons about column name mismatch, then generates correct AirportCode query. Step 180 — reasons about course_title vs course_name after error, corrects to right column.
- Empty think blocks for easy questions: Steps 20-80 all show <think></think> with no content — model skips thinking when confident. Good token efficiency.
- NEW failure mode: <think>assistant degenerate loop — ~10/43 sampled steps (23%) show <think>assistant<think>assistant... repeating until token limit. Model fails to close </think> and enters repetitive pattern. Steps 110, 140, 200, 260, 300, 340, 410, 420, 430 all exhibit this. Burns entire token budget with no useful output.
- Multi-table JOINs with subqueries work (Step 30: NOT IN subquery, Step 80: UNION, Step 435: correlated subquery with HAVING).
- Final step 435: model writes complex correlated subquery with HAVING count(*) = (SELECT ... ORDER BY count(*) DESC LIMIT 1) — correct answer "Martin".
Phase 2 (467 easy+medium, beta=0.0, 1280 tokens, stopped at step 182/467 — likely OOM):
- Reward oscillated 0.1-0.85, averaging ~~0.45. Comparable to Run 7 Phase 2 (~~0.5).
- Step 10: Easy question solved in 3 tool calls (describe→query→answer). Reward 1.15.
- Step 90: Multi-table JOIN with HAVING count(*) < 200 — correct, reward 1.15.
- Step 110: NOT IN subquery for stadiums without concerts — correct on first try.
- Step 140: Cross-table JOIN (evaluation + employee, MAX(bonus)) — correct.
- Step 150: Multi-table chain reasoning with thinking — corrected Document_Name → Template_ID join path after 2 errors. Long <think> block with correct reasoning.
- Step 170: Double-year intersection query (Stadium_ID IN ... 2014 AND Stadium_ID IN ... 2015) — correct.
- Crashed at step 182 — likely OOM from 1280 max_new_tokens + thinking blocks consuming more memory during generation.
- Model checkpoint was NOT pushed to HF Hub before crash.
Persistent issues:
- <think>assistant degenerate loop (~23% of Phase 1 steps) — new failure mode unique to thinking mode
- Multi-table FK chain queries still fail on medium difficulty (same as Run 7)
- Phase 2 no better than Run 7's Phase 2 — thinking mode doesn't help with the fundamental JOIN reasoning gap

Run 9 (v2 continued training, no-think): Confirms Phase 2 ceiling

Date: 2026-04-11
Changes: Resumed from v1 checkpoint (Run 7's final weights), 2 epochs Phase 1 + 2 epochs Phase 2. Fixed model preset lookup (_get_preset() matching on "1.7b" in name string instead of exact .get()).
Branch: feat/f011-3-way-comparison-notebook
Phase 1 (435 easy, beta=0.04, 512 tokens, ~3h34m, 870 steps):
- Loss: oscillates 0.01-0.13, occasional negatives (-0.05) in second half. More negative values than v1 Phase 1 — expected since starting from trained checkpoint, less to learn.
- Rewards: sawtooth 0.01-1.15. Easy questions solved reliably (describe→query→answer in 3 calls). Medium questions from mixed batches still fail.
- Model behavior: solid tool-call format, comma-separated lists, pipe-delimited rows. No format collapse.
- Step 300: Degenerate SQL — ORDER BY HorsepowerDESC (missing space), repeated 3 times. Token budget consumed.
- Step 560: Degenerate completion — output "icher Consulting Solution" (truncated gibberish). Reward 0.00. One-off.
Phase 2 (467 easy+medium, beta=0.0, 1024 tokens, ~3h50m, 934 steps):
- Loss: oscillates -0.13 to +0.12, trend more negative than Phase 1 — policy sharpens on known patterns without KL regularization.
- Rewards: same sawtooth 0.01-1.15 as Phase 1, no upward trend. Mean ~0.5.
- Successes (medium): Step 140 — JOINed evaluation→employee for MAX(bonus), found "Louis Deacon" (1.13 reward). Step 750 — subquery COUNT(*) > (SELECT ... ORDER BY Horsepower DESC LIMIT 1), answered "39" correctly.
- Failures (medium): Step 20 — hallucinated make_id, full_name columns, budget exhausted after 8+ tool calls. Step 50 — invented Course_Attendance table, cascading errors. Step 530 — tried Bred, Breed before finding Breeds, then queried wrong column.
- Persistent pattern: Model describes tables correctly but writes SQL with wrong column names from pretraining knowledge (e.g., full_name instead of FullName, country.name when table is singer with Country column).
- Final reward: 0.048 (last step was incorrect)
Charts: Reward Trend (Phase 1→2) shows flat continuation — no improvement from adding medium questions. Loss in Phase 2 oscillates around 0, with spikes to -0.13 (GRPO reinforcing already-known easy patterns).
Conclusion: v2 confirms v1 findings. The 0.6B model's accuracy ceiling is set by pretraining SQL knowledge, not RL training budget. More epochs don't help medium questions. Next interventions: (1) more SFT on multi-table JOINs with correct column names, (2) larger model (1.7B), or (3) increase step budget to let model iterate.

Eval Format Fix (F011 comparison notebook)

Date: 2026-04-10
Problem: compare_methods.ipynb eval fed models a different message format than TRL training:
1. Tool results posted as role: "user" — training uses role: "tool" (Qwen3 renders as <tool_response> wrapper)
2. Assistant turns stored as raw text content — training uses structured tool_calls dicts with JSON-string arguments
3. Question + table hint separated by \n\n — TRL appends reset() return directly to user message (no separator)
Discovery method: Added debug cell to render prompts via apply_chat_template and compared side-by-side with TRL training log output. The role: "tool" format renders as <|im_start|>user\n<tool_response>...</tool_response> while role: "user" renders as <|im_start|>user\nplain text — structurally different despite both appearing under user token.
Fix: Changed LLMToolCallingPolicy in compare_methods.ipynb to match TRL exactly: structured tool_calls, role: "tool", concatenated user message. Also parse ALL <tool_call> blocks per generation and buffer extras (matches TRL's _tool_call_loop).
Result (N=50, base=Qwen3-0.6B, 2026-04-11, with parse-failure retry, 2 runs):
- Run A:
  - zero-shot: 0% accuracy, 28% parse rate, avg 10.8 steps (31/50 budget exhaust)
  - 1-shot: 0% accuracy, 16% parse rate, avg 14.8 steps (49/50 budget exhaust)
  - 3-shot: 0% accuracy, 20% parse rate, avg 13.8 steps (44/50 budget exhaust)
  - grpo-v1: 28% accuracy, 95% parse rate, avg 4.0 steps, avg reward 0.355
  - grpo-v2: 32% accuracy, 87% parse rate, avg 3.7 steps, avg reward 0.400
- Run B (same day, different Colab session):
  - zero-shot: 0% accuracy, 24% parse rate, avg 12.4 steps (38/50 budget exhaust)
  - 1-shot: 2% accuracy, 17% parse rate, avg 14.0 steps (46/50 budget exhaust)
  - 3-shot: 0% accuracy, 19% parse rate, avg 14.8 steps (49/50 budget exhaust)
  - grpo-v1: 30% accuracy, 100% parse rate, avg 3.5 steps, avg reward 0.386
  - grpo-v2: 24% accuracy, 95% parse rate, avg 3.6 steps, avg reward 0.321
- Run-to-run variation: v1 scored 28% then 30%, v2 scored 32% then 24%. The ~~6-8pp swing confirms v1 and v2 are statistically indistinguishable at N=50. Report as "~~30% accuracy" for both.
- Parse failure retry: base models no longer die on first parse failure — they get a no-op DESCRIBE and continue. This reveals they waste their entire 15-step budget repeating the same malformed output.
- Base model failure mode: can't produce <tool_call> format (76-83% parse failure rate). GRPO failure mode: produces valid tool calls but writes wrong SQL.
- 1-shot scored 2% in Run B (1 lucky episode) — demonstrates N=50 noise floor for rare events.
Checkpoint naming: grpono-no-thinking was caused by HF_SUFFIX="no-no-thinking" (missing leading dash) and subsequent HF UI rename. See "Discovered Issues" section.
TRL format verified from source: reset() return is appended to last user message (TRL docs + grpo_trainer.py). Tool results use {"role": "tool", "name": name, "content": result}. Generation runs to EOS (no stop at </tool_call>), all parsed tool calls executed in sequence.

Current Status (after Run 9)

Working:

Multi-turn SFT + assistant_only_loss — still the critical foundation
GRPO learns on easy questions: reward −0.1→0.7 in Phase 1 (both Run 7 and 8)
Repeat penalty (F015) fires correctly on exact-match repeated calls
Error recovery: describe→retry after SQL error is a learned behavior
Answer format: single values, comma-separated lists, pipe-delimited rows, [] for empty
Thinking mode triggers on errors — model reasons about column name mismatches and table structure after SQL errors (Steps 90, 150, 180, 220, 280 in Run 8)
Empty think blocks for easy questions — model doesn't waste tokens thinking when confident

Not yet working:

Multi-table FK chain queries (medium difficulty) — confirmed across Runs 7, 8, 9. More RL epochs don't help.
Phase 2 shows no improvement over Phase 1 — medium questions need more SFT coverage on JOIN patterns
Column name hallucination from pretraining — model reads schema correctly then writes pretrained column names
Model doesn't use sample tool (learned in Run 6 but lost?)
<think>assistant degenerate loop — thinking mode (Run 8) introduces ~23% failure rate from unclosed think tags

For comparison notebook (F011):

v1 checkpoint on HF Hub: hjerpe/sqlenv-qwen3-0.6b-grpo
v2 checkpoint on HF Hub: hjerpe/sqlenv-qwen3-0.6b-grpo-v2
Run 8 (thinking) checkpoint was NOT pushed — Colab session crashed before save
N=50 eval completed 2026-04-11 (2 runs): v1 ~28-30%, v2 ~24-32%, confirming both are ~30% and within run-to-run noise
v1 and v2 are statistically indistinguishable — the difference between runs is larger than the difference between checkpoints
Thinking mode comparison can be added later when a checkpoint is available

Possible next interventions:

Thinking mode training (0.6B): Resume from v1 with ENABLE_THINKING=True, push as -think suffix. Run 8 showed thinking helps error recovery but crashed before save.
More SFT on multi-table JOINs: Add trajectories with 3+ table chains, correct column names after describe. Highest priority — v2 proved more RL epochs don't help without this.
Increase model size: Switch from 0.6B to 1.7B. Larger model may override pretrained column name biases from schema context.

OOM prevention for next thinking-mode run:

The Run 8 Phase 2 crash at step 182/467 was likely OOM. Root causes and mitigations:

max_new_tokens=1280 is too high for L4 with thinking — medium questions trigger long <think> blocks (Step 50 reasoning about >1 vs >=1, Step 120 about breed/size format, Step 130 about T1.distinct_city column mismatch). Reduce to 1024 for Phase 2.
num_generations=4 compounds the problem — each generation runs inference independently, so 4 rollouts × 1280 tokens = 5120 tokens of peak generation memory. Reduce to 3 generations for thinking-mode Phase 2. The generation_batch_size must also be updated to match.
The <think>assistant degenerate loop inflates effective token usage — a rollout that enters the loop consumes the full max_new_tokens budget producing garbage. Fixing this loop via SFT (adding 5-10 examples with proper <think>reasoning</think> blocks) would reduce average token consumption significantly, making OOM less likely even at higher token limits.
Phase 2 has no KL reference model (beta=0) — so memory is only model + generation buffers. The OOM is purely from generation length, not model copies.

Recommended config for next thinking-mode run (Phase 2):

config2 = replace(config,
    beta=0.0,
    max_new_tokens=1024,      # was 1280
    num_generations=3,         # was 4
    enable_thinking=True,
)

Also set generation_batch_size=3 in notebook_pipeline.py (it must equal num_generations).

Historical: Status after Run 6

Architecture decisions to preserve:

Multi-turn SFT with assistant_only_loss — critical over per-turn
Qwen3 template patch ({% generation %} tags) for SFT, restore original before GRPO
SFT args as JSON strings (not dicts) — critical for Qwen3
Phase 1 (easy, KL) → Phase 2 (easy+medium, no KL)
DB opened with mode=ro — safety enforced by SQLite, not regex

File Map

File	What changed
`scripts/generate_sft_data.py`	Multi-turn trajectories, JSON string args, answer formatting
`scripts/inspect_sft_data.py`	SFT data stats + tokenizer-rendered inspection
`training/trl_adapter.py`	Post-episode penalty (-0.3), error surfacing, `_result_or_error`
`training/config.py`	Added beta field (KL penalty)
`training/notebook_pipeline.py`	generation_batch_size, beta passthrough
`server/verifier.py`	`_strip_answer_wrapping` preprocessing
`server/sql_environment.py`	SQL validation allows SELECT and WITH
`notebooks/train_grpo.ipynb`	Multi-turn SFT, assistant_only_loss, template patch/restore, HF_SUFFIX

Key Learnings

Qwen3's apply_chat_template expands dict args — always use JSON strings for SFT tool_call arguments.
Multi-turn SFT is critical for agentic GRPO — per-turn examples teach one action; the model never learns the full workflow. Full trajectory SFT with assistant_only_loss teaches describe→query→answer as a coherent strategy.
Qwen3 template lacks {% generation %} tags — patch before SFT for assistant_only_loss, restore before GRPO. TRL's add_response_schema() and get_training_chat_template() do exact string equality on the template.
Don't show competing formats to small models — arrow-notation few-shot examples confused the model when it needed to produce <tool_call> JSON.
KL penalty effectiveness depends on SFT quality — beta=0.04 was "too high" only because the SFT policy was single-turn. With multi-turn SFT, the same beta works fine.
Reference model doubles memory — plan for this when using KL penalty on L4.
Let the SQL engine enforce safety, not regex — hard-coded SELECT-only prefix check blocks valid read-only SQL (CTEs). The DB is already mode=ro.
Render training data through the actual tokenizer — inspect scripts that reformat JSON are fragile. The ground truth is apply_chat_template output from the same tokenizer instance used for training.
Error loops are a 1.7B capacity limit — the model repeats failing queries verbatim because <think> is suppressed and it can't reason about the error. Enabling thinking mode may help.
Post-episode penalty of -0.1 is too weak — model still makes 3-5 extra calls. Increased to -0.3.
Repeat penalty works but doesn't fix root cause — the −0.2 penalty fires correctly on exact-match repeated tool calls, but the model's real problem is pretrained column-name hallucination, not repetition per se. The model varies its queries enough to avoid exact repeats while still failing on the same conceptual error.
Phase 2 (medium) doesn't improve over Phase 1 (easy) — reward plateau at ~0.5 suggests the model needs more SFT coverage on multi-table JOINs, not just more GRPO steps. RL can't teach FK chain reasoning that isn't in the initial policy.
Thinking mode helps error recovery but doesn't improve overall accuracy — the model uses <think> blocks to reason about SQL errors (column name mismatches, table structure), leading to correct retries. But accuracy on easy questions is similar to no-think Run 7. The benefit is qualitative (better error recovery) not quantitative (higher reward).
<think>assistant degenerate loop is a new failure mode — ~23% of thinking-mode steps degenerate into <think>assistant<think>assistant... repeating until token limit. The model fails to produce </think> and enters a repetitive pattern. This is the thinking-mode equivalent of Run 7's post-episode repetition. Fix: add SFT examples with proper <think>reasoning</think> blocks.
Empty <think></think> blocks are good — the model learns to skip thinking on easy questions, preserving tokens for tool calls. This is emergent behavior from GRPO reward signal (thinking wastes tokens → lower reward on easy questions).
1280 max_new_tokens is too aggressive for thinking mode on L4 — Phase 2 crashed at step 182/467, likely OOM. The longer <think> blocks in Phase 2 (medium questions trigger more reasoning) push memory past L4's 24GB. Use 1024 max_new_tokens for thinking-mode Phase 2.
Public methods on environment_factory become TRL tools — TRL introspects all public methods for JSON schema generation. The configure() classmethod caused a DocstringParsingException. Keep configuration methods private (_configure).
Continued training from checkpoint doesn't unlock medium questions — v2 ran 2 more epochs of Phase 1 + Phase 2 from v1's final checkpoint. Reward stayed flat at ~0.5 mean. The model reliably solves easy single-table queries but can't learn multi-table FK chain reasoning from RL alone. The policy needs SFT coverage on the patterns it can't discover through trial-and-error.
Column name hallucination is the dominant error mode — the model describes tables correctly (seeing FullName: TEXT) then writes SELECT full_name or SELECT Maker, FullName FROM car_makers ORDER BY MakerDESC LIMIT 1 (missing space). This is pretrained SQLese overriding the schema information the model just read. A 0.6B model can't override pretraining biases through RL reward signal alone.
Eval must exactly match TRL's message format — role:"tool" for env results (not role:"user"), structured tool_calls dicts for assistant turns (not raw <tool_call> text in content), question+table_hint concatenated without separator (TRL appends reset() return to last user message). Qwen3 renders role:"tool" as <|im_start|>user\n<tool_response>...</tool_response> — looks like a user message but is structurally different. Getting this wrong caused 0% accuracy across all conditions; fixing it recovered 10-50% on base model.
Incorrect answer reward of 0.0 creates an avoid-answering incentive — exploration steps accumulate ~~0.05-0.15 reward. Calling answer(wrong) gives 0.0 and ends the episode, so total reward (~~0.05) can be lower than not answering and exploring until budget (~0.10). The model may learn to write prose instead of calling answer() when uncertain. PRS (Progressive Reward Shaping, arxiv 2512.07478) addresses this with a small format-compliance reward for completing the tool pipeline regardless of correctness.
Continued training trades guessing for abstention — v2 outputs "Task complete." instead of calling answer() on hard questions — a form of calibrated uncertainty. v1 guesses more but gets fewer right per attempt. The 0.0 incorrect-answer reward (learning #21) drives this: v2 internalized that guessing wrong is worse than not answering.
v1 and v2 are statistically indistinguishable at N=50 — across two runs, v1 scored 28% then 30%, v2 scored 32% then 24%. The 6-8pp run-to-run variation exceeds the checkpoint difference. v2's abstention behavior (learning #22) adds variance: on borderline questions, whether v2 guesses or outputs "Task complete." varies by run. For reporting, use "30% accuracy" for both checkpoints. N=200+ would be needed to detect a real 4pp difference with 80% power.