Spaces:

ronitraj
/

QuantumScribe

Sleeping

App Files Files Community

QuantumScribe / scripts /train_grpo.py

ronitraj

Upload scripts/train_grpo.py with huggingface_hub

6ac3b26 verified 12 days ago

raw

history blame contribute delete

70.5 kB

	"""scripts/train_grpo.py - GRPO RL phase (2026-04 spec rewrite).

	Loads the SFT-warm-started LoRA adapter at
	``checkpoints/sft_warmup/checkpoint-50`` on top of the 4-bit NF4 quantised
	``unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit`` base, connects to the
	OpenEnv server (local or remote via ``QUBIT_MEDIC_URL``), and runs TRL's
	:class:`GRPOTrainer` for 1,500 steps with diversity-focused rollout
	sampling (temperature=1.2, top_p=0.95, top_k=50, repetition_penalty=1.1)
	and a weighted 5-component reward bounded to [0, 1].

	Why diversity-focused sampling
	------------------------------
	The first GRPO attempt (temperature=0.7) collapsed inside 100 steps to a
	constant ``X_ERRORS=[] Z_ERRORS=[]`` policy: every group of 4 generations
	was byte-identical, so within-group reward variance was zero and the
	GRPO advantage was exactly zero - no gradient. The new sampler defaults
	broaden per-token entropy enough to keep within-group variance positive,
	which is what GRPO needs to learn anything.

	Major spec features wired up here
	---------------------------------
	* ``_diversity_preflight`` - 5 prompts x 8 completions at T=1.2; abort if
	fewer than 3 prompts hit >=3 unique completions. The model is too
	collapsed for GRPO to recover.
	* Frozen 200-syndrome eval set seeded ``4284`` (matches SFT validation
	seed). Cached to ``data/grpo_validation.jsonl`` so reruns and offline
	inspection see the same prompts.
	* Tier-1 training metrics (every 5 steps): total_reward_mean,
	reward_std_within_group, completion_uniqueness, advantage_mean_abs,
	kl_divergence, grad_norm, policy_loss, learning_rate.
	* Tier-2 eval metrics (every 100 steps, greedy at T=0): logical
	correction rate, pymatching beat rate, format compliance, exact-match
	pymatching, hard-syndrome (>=2 errors) LCR, syndrome consistency,
	avg_completion_length, output_diversity at T=1.0.
	* Tier-3 (every eval): per-round logical-error rate at d=3 p=0.001 plus
	log10 transform.
	* Sample-completion table every 50 steps: 5 random eval prompts, the 4
	rollouts each, per-component rewards, parsed action.
	* Anti-hacking: 30s per-episode timeout (server-side), reward bounds
	enforced both pre-multiply and post-sum, mode-collapse inspection
	every 100 steps that auto-raises temperature by 0.2 if >7 of the
	last 10 prompts produced 4 byte-identical generations.
	* Wall-clock cap: 13h. Saves+exits cleanly if exceeded.
	* Best-checkpoint tracking: writes ``output/best/`` whenever a new best
	``eval/total_reward_mean`` is observed. Final state always saves to
	``output/final/`` regardless of rank.
	* Decision rules (warnings only, no auto-fix): step-50 reward variance
	floor, step-500 pymatching-beat sanity, format-compliance floor, and
	3-consecutive-log grad-norm runaway alarm.

	Usage::

	python -m scripts.train_grpo \
	--sft-checkpoint checkpoints/sft_warmup/checkpoint-50 \
	--output checkpoints/grpo \
	--report-to wandb
	"""
	from __future__ import annotations

	import argparse
	import inspect
	import json
	import os
	import random

	# torch._dynamo recompile-limit guard. Unsloth's GRPO trainer wraps the
	# loss/generation graph in torch.compile(fullgraph=True). Two things blow
	# past Dynamo's default cache_size_limit (8) over a long GRPO run:
	# 1. The mode-collapse hook mutates trainer.args.temperature in flight
	# (e.g. 1.2 -> 1.4 -> 1.6 -> 1.8); each mutation re-specializes the
	# compiled generation path.
	# 2. Variable prompt/completion shapes specialize over hundreds of steps.
	# When the limit is hit, fullgraph=True turns it into a fatal
	# FailOnRecompileLimitHit (we lost a run at step 400/1500 to this). Set the
	# limits high before torch is imported so they take effect everywhere.
	os.environ.setdefault("TORCHDYNAMO_CACHE_SIZE_LIMIT", "256")
	os.environ.setdefault("TORCHDYNAMO_RECOMPILE_LIMIT", "256")
	import shutil
	import sys
	import threading
	import time
	from collections import deque
	from dataclasses import dataclass, field
	from pathlib import Path
	from typing import Iterable, Optional


	# --------------------------------------------------------------------------- #
	# Pre-flight: detect the unsloth / unsloth_zoo signature skew that crashes #
	# GRPO at step 0 with a misleading TypeError. #
	# --------------------------------------------------------------------------- #
	#
	# unsloth==2025.11.1's GRPO trainer template calls
	# grpo_accumulated_loss(..., old_hidden_states=..., ref_hidden_states=...)
	# but unsloth_zoo>=2026.x renamed those positional args to old_logps / ref_logps
	# with no compat shim. Pip's resolver (with the unpinned `unsloth` line in
	# requirements-train.txt) silently couples the two: it picks
	# unsloth==2025.11.1 + unsloth_zoo==2026.4.9
	# and that pair crashes at the first training step with:
	# TypeError: grpo_accumulated_loss() missing 2 required positional
	# arguments: 'old_logps' and 'ref_logps'
	#
	# SFT does not exercise this code path, so SFT finishes cleanly first, the
	# checkpoint gets saved, and only then GRPO blows up - wasting the whole SFT
	# run. This guard runs in well under a second, before any GPU work, and
	# prints the exact pip command to fix it instead of the cryptic TypeError.
	# --------------------------------------------------------------------------- #


	def _assert_grpo_signature_compatible() -> None:
	"""Abort early if the installed unsloth_zoo signature does not match
	the call pattern baked into the installed unsloth.
	"""
	try:
	import unsloth # noqa: F401 (force the patches to apply first)
	import unsloth_zoo
	from unsloth_zoo.rl_replacements import grpo_accumulated_loss
	except Exception as exc:
	print(f"[grpo-guard] WARNING: could not introspect unsloth_zoo "
	f"({exc!r}); skipping signature check.", file=sys.stderr)
	return

	params = list(inspect.signature(grpo_accumulated_loss).parameters.keys())
	has_hidden = "old_hidden_states" in params and "ref_hidden_states" in params
	has_logps = "old_logps" in params and "ref_logps" in params

	# The unsloth in this repo is pinned to the 2025.11.x lineage (matches
	# what SFT just used). That lineage calls with old_hidden_states= /
	# ref_hidden_states=. If unsloth_zoo has those names, we are fine.
	if has_hidden:
	return

	unsloth_ver = getattr(unsloth, "__version__", "?")
	zoo_ver = getattr(unsloth_zoo, "__version__", "?")
	have_logps_only = has_logps and not has_hidden

	msg = [
	"",
	"=" * 78,
	"[grpo-guard] FATAL: unsloth / unsloth_zoo signature mismatch detected.",
	"=" * 78,
	f" unsloth == {unsloth_ver}",
	f" unsloth_zoo == {zoo_ver}",
	f" grpo_accumulated_loss parameters: {params}",
	"",
	" unsloth (this version) calls grpo_accumulated_loss with",
	" old_hidden_states=... , ref_hidden_states=...",
	" but the installed unsloth_zoo expects",
	" old_logps=... , ref_logps=...",
	" as required positional arguments." if have_logps_only else
	" but the installed unsloth_zoo signature does not contain those names.",
	"",
	" Without this fix, GRPO will crash at step 0 with:",
	" TypeError: grpo_accumulated_loss() missing 2 required positional",
	" arguments: 'old_logps' and 'ref_logps'",
	"",
	" Fix on Colab (one-liner):",
	" pip install --no-deps --force-reinstall unsloth_zoo==2025.11.1 \\",
	" && rm -rf unsloth_compiled_cache",
	"",
	" Then re-run:",
	" python -m scripts.train_grpo --sft-checkpoint "
	"checkpoints/sft_warmup/checkpoint-50 \\",
	" --output checkpoints/grpo",
	"=" * 78,
	"",
	]
	raise SystemExit("\n".join(msg))


	def _wipe_stale_grpo_cache() -> None:
	"""Remove unsloth_compiled_cache/UnslothGRPOTrainer.py if present.

	The cache file is regenerated automatically by unsloth on the next
	GRPO import using the currently installed unsloth_zoo source, so
	deleting it is safe and is the only way to recover after fixing
	a previously-mismatched install.
	"""
	cache_file = Path("unsloth_compiled_cache") / "UnslothGRPOTrainer.py"
	if cache_file.exists():
	print(f"[grpo-guard] removing stale {cache_file} so it regenerates "
	f"against the current unsloth_zoo install")
	try:
	cache_file.unlink()
	except OSError as exc:
	print(f"[grpo-guard] WARNING: failed to remove {cache_file}: "
	f"{exc!r}", file=sys.stderr)


	# --------------------------------------------------------------------------- #
	# Per-batch scoring cache + reward bounds enforcement #
	# --------------------------------------------------------------------------- #
	#
	# The original implementation called the env 5 times per (prompt, completion)
	# - once per reward function. We fix that with a single (prompt, completion)
	# -> breakdown cache keyed inside one GRPO step, AND we apply the spec's
	# weighted-sum + [0, 1] clip in one place so every reward function returns
	# a number that's already correctly weighted.
	# --------------------------------------------------------------------------- #


	@dataclass
	class _ScoredCompletion:
	"""One scored (prompt, completion) pair, keyed by the env episode."""
	rewards: dict # raw per-component rewards from the env (in [0, 1])
	weighted_total: float # weighted sum, clipped to [0, 1]
	parse_success: bool
	parse_partial: bool
	x_pred: list
	z_pred: list
	actual_flip: int
	pm_flip: int
	elapsed: float
	timed_out: bool
	curriculum_level: str
	bounds_violations: int # >0 if env returned a component outside [0, 1]


	@dataclass
	class _BatchScoringCache:
	"""Caches per-(prompt, completion) scores within one GRPO step."""
	env_client: object
	reward_weights: dict
	_cache: dict = field(default_factory=dict)
	_step_keys: list = field(default_factory=list)
	_lock: threading.Lock = field(default_factory=threading.Lock)
	_all_curriculum_stats: dict = field(default_factory=dict)
	_episodes: int = 0
	_timeouts: int = 0
	_bounds_violations: int = 0

	def _enforce_bounds(self, name: str, val: float) -> tuple[float, bool]:
	"""Clip a reward component to [0, 1]; flag if it was outside."""
	v = float(val)
	if v < 0.0 or v > 1.0:
	return max(0.0, min(1.0, v)), True
	return v, False

	def score(self, prompt: str, completion: str) -> _ScoredCompletion:
	key = (prompt, completion)
	with self._lock:
	entry = self._cache.get(key)
	if entry is not None:
	return entry
	# Env work is independent across (p, c) so it's safe to release the
	# lock during the network round-trip.
	obs = self.env_client.reset()
	result = self.env_client.step(raw_response=completion,
	episode_id=obs.episode_id)
	info = result.info
	action = info.get("parsed_action", {})

	# Apply spec weights + [0, 1] bounds enforcement.
	raw = info.get("rewards", {}) or {}
	violations = 0
	weighted_sum = 0.0
	bounded_components: dict = {}
	for name, weight in self.reward_weights.items():
	v, was_oob = self._enforce_bounds(name, raw.get(name, 0.0))
	bounded_components[name] = v
	weighted_sum += weight * v
	if was_oob:
	violations += 1
	# Clip weighted sum to [0, 1] (already in range when components
	# are; defensive against weights that don't sum to 1.0).
	weighted_total = max(0.0, min(1.0, weighted_sum))

	# Preserve env's "total" alongside our weighted total so downstream
	# wandb log_reward_breakdown still works.
	bounded_components["total"] = weighted_total

	scored = _ScoredCompletion(
	rewards=bounded_components,
	weighted_total=weighted_total,
	parse_success=bool(action.get("parse_success", False)),
	parse_partial=False,
	x_pred=list(action.get("x_error_qubits", []) or []),
	z_pred=list(action.get("z_error_qubits", []) or []),
	actual_flip=int(info.get("actual_observable_flip", 0)),
	pm_flip=int(info.get("pymatching_observable_pred", 0)),
	elapsed=float(info.get("elapsed_seconds", 0.0)),
	timed_out=bool(info.get("timed_out", False)),
	curriculum_level=str(getattr(obs, "curriculum_level", "")),
	bounds_violations=violations,
	)
	with self._lock:
	self._cache[key] = scored
	self._step_keys.append(key)
	self._all_curriculum_stats = info.get("curriculum_stats", {}) or {}
	self._episodes += 1
	if scored.timed_out:
	self._timeouts += 1
	if violations:
	self._bounds_violations += violations
	return scored

	def drain_step(self):
	"""Pop everything cached since the last drain_step() call."""
	with self._lock:
	entries = [self._cache[k] for k in self._step_keys]
	keys = list(self._step_keys)
	self._step_keys.clear()
	# Bound memory use - long runs with unique strings.
	if len(self._cache) > 4096:
	self._cache.clear()
	return entries, keys


	def _seed_everything(seed: int) -> None:
	import numpy as np
	import torch
	random.seed(seed)
	np.random.seed(seed)
	torch.manual_seed(seed)
	if torch.cuda.is_available():
	torch.cuda.manual_seed_all(seed)


	# --------------------------------------------------------------------------- #
	# Reward function factory #
	# --------------------------------------------------------------------------- #
	#
	# Spec: total reward = sum of weighted components, clipped to [0, 1].
	# Implementation: the cache returns a per-completion weighted_total in
	# [0, 1]. We expose ONE TRL reward function that returns that bounded
	# total, plus zero-weight per-component observers so wandb gets per-
	# component traces without altering the policy gradient.
	# --------------------------------------------------------------------------- #


	_REWARD_COMPONENTS = (
	"logical_correction",
	"hamming_overlap",
	"syndrome_consistency",
	"format_compliance",
	"pymatching_beat",
	)


	def _make_reward_fns(cache: _BatchScoringCache):
	def total_fn(prompts, completions, **_unused):
	scored = [cache.score(p, c) for p, c in zip(prompts, completions)]
	return [s.weighted_total for s in scored]
	total_fn.__name__ = "reward_total"

	observers: list = []
	for name in _REWARD_COMPONENTS:
	def _factory(component_name: str):
	def fn(prompts, completions, **_unused):
	scored = [cache.score(p, c) for p, c in zip(prompts, completions)]
	return [s.rewards.get(component_name, 0.0) for s in scored]
	fn.__name__ = f"reward_obs_{component_name}"
	return fn
	observers.append(_factory(name))

	return [total_fn] + observers


	# --------------------------------------------------------------------------- #
	# Frozen eval set: 200 syndromes seeded GRPO_VAL_SEED. #
	# --------------------------------------------------------------------------- #
	#
	# We snapshot the 200 prompts to data/grpo_validation.jsonl on first run so
	# reruns hit byte-identical syndromes, and so the file can be inspected /
	# diffed offline. If the file already exists with >= n rows, we trust it.
	# --------------------------------------------------------------------------- #


	def _load_or_build_eval_set(env_client, *, seed: int, n: int, path: str) -> list[dict]:
	p = Path(path)
	if p.exists():
	rows: list[dict] = []
	with p.open("r") as f:
	for line in f:
	line = line.strip()
	if not line:
	continue
	rows.append(json.loads(line))
	if len(rows) >= n:
	print(f"[grpo-eval] reusing cached eval set: {p} ({len(rows)} rows)")
	return rows[:n]
	print(f"[grpo-eval] cached eval set at {p} has {len(rows)} < {n} rows; "
	f"regenerating")

	p.parent.mkdir(parents=True, exist_ok=True)
	rows = []
	print(f"[grpo-eval] building frozen eval set seed={seed} n={n} -> {p}")
	cur_seed = seed
	for _ in range(n):
	obs = env_client.reset(seed=cur_seed)
	rows.append({
	"prompt": obs.prompt,
	"episode_id": int(obs.episode_id),
	"curriculum_level": str(getattr(obs, "curriculum_level", "")),
	"distance": int(getattr(obs, "distance", 0)),
	"rounds": int(getattr(obs, "rounds", 0)),
	"p": float(getattr(obs, "p", 0.0)),
	"syndrome_bits": list(getattr(obs, "syndrome_bits", []) or []),
	"seed": cur_seed,
	})
	cur_seed += 1 # deterministic, reproducible
	with p.open("w") as f:
	for r in rows:
	f.write(json.dumps(r) + "\n")
	print(f"[grpo-eval] wrote {len(rows)} eval rows to {p}")
	return rows


	# --------------------------------------------------------------------------- #
	# Diversity preflight #
	# --------------------------------------------------------------------------- #


	def _diversity_preflight(model, tokenizer, *, val_path: str, n_prompts: int = 5,
	n_samples_per_prompt: int = 8, temperature: float = 1.2,
	min_unique: int = 3, min_passing: int = 3,
	max_new_tokens: int = 50) -> bool:
	"""Generate ``n_samples_per_prompt`` completions per prompt at high temp.

	Returns True iff at least ``min_passing`` of the prompts produced
	>= ``min_unique`` unique completions (byte-equal under skip-special-tokens
	decoding). False -> the model is collapsed past the point where GRPO
	can recover, so we should refuse to start training.
	"""
	import torch

	src = Path(val_path)
	if not src.exists():
	print(f"[grpo-preflight] WARNING: {val_path} not found; "
	f"skipping diversity preflight")
	return True # don't block on missing file

	rows: list[dict] = []
	with src.open("r") as f:
	for line in f:
	line = line.strip()
	if not line:
	continue
	rows.append(json.loads(line))
	if len(rows) < n_prompts:
	print(f"[grpo-preflight] WARNING: only {len(rows)} validation rows "
	f"available, need {n_prompts}; using all")
	n_prompts = len(rows)

	# Mix of trivial (no errors) and non-trivial (errors present), so the
	# diversity probe sees both regimes the model has to handle.
	rng = random.Random(0)
	trivial = [r for r in rows if not r.get("had_errors")]
	non_trivial = [r for r in rows if r.get("had_errors")]
	rng.shuffle(trivial)
	rng.shuffle(non_trivial)
	half = max(1, n_prompts // 2)
	chosen = (non_trivial[:half] + trivial[:n_prompts - half])[:n_prompts]
	if not chosen:
	chosen = rows[:n_prompts]

	print(f"[grpo-preflight] probing diversity at T={temperature} on "
	f"{len(chosen)} prompts x {n_samples_per_prompt} samples each")

	try:
	from unsloth import FastLanguageModel
	FastLanguageModel.for_inference(model)
	except Exception:
	model.eval()
	passing = 0
	per_prompt_unique: list[int] = []
	pad_id = tokenizer.pad_token_id or tokenizer.eos_token_id
	for i, row in enumerate(chosen):
	prompt = row["prompt"]
	try:
	chat = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(
	chat, tokenize=False, add_generation_prompt=True,
	)
	except Exception:
	text = ("<\|im_start\|>user\n" + prompt
	+ "\n<\|im_end\|>\n<\|im_start\|>assistant\n")
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	completions: list[str] = []
	for _ in range(n_samples_per_prompt):
	with torch.no_grad():
	out = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	do_sample=True,
	temperature=temperature,
	top_p=0.95,
	top_k=50,
	repetition_penalty=1.1,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=pad_id,
	)
	gen_ids = out[0][inputs["input_ids"].shape[1]:]
	txt = tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
	completions.append(txt)
	unique = len(set(completions))
	per_prompt_unique.append(unique)
	verdict = "PASS" if unique >= min_unique else "FAIL"
	print(f"[grpo-preflight] prompt {i}: {unique}/{n_samples_per_prompt} "
	f"unique [{verdict}] examples={completions[:2]!r}")
	if unique >= min_unique:
	passing += 1

	overall = passing >= min_passing
	print(f"[grpo-preflight] {passing}/{len(chosen)} prompts passed "
	f"(threshold: >= {min_passing}). per_prompt_unique={per_prompt_unique}")

	if not overall:
	print("=" * 78)
	print("[grpo-preflight] PRE-FLIGHT FAILED - model is too collapsed; "
	"redo SFT with regularization before launching GRPO")
	print("=" * 78)
	return overall


	# --------------------------------------------------------------------------- #
	# In-loop W&B callback (tier-1 + tier-2 + tier-3 + sample table + safeguards) #
	# --------------------------------------------------------------------------- #


	def _build_wandb_callback(cache, model, tokenizer, env_client, eval_rows,
	*, sample_every: int, sample_n: int,
	inloop_every: int,
	inloop_max_new_tokens: int,
	kl_alarm: float,
	inspection_every: int, inspection_sample_n: int,
	inspection_collapse_threshold: int,
	temp_bump_on_collapse: float,
	best_dir: Path, output_dir: Path,
	wall_seconds: float,
	decision_thresholds: dict):
	from transformers import TrainerCallback

	from qubit_medic import wandb_utils

	if not wandb_utils.is_available():
	return None

	started_at = time.time()

	# Rolling cache for the inspection hook: we record (group_unique_count)
	# per prompt as it streams in, and at every inspection_every-step
	# boundary look at the most recent inspection_sample_n entries.
	recent_uniques = deque(maxlen=max(inspection_sample_n, 16))

	grad_norm_run = deque(maxlen=decision_thresholds["grad_norm_run_len"])
	state = {
	"best_total_reward": float("-inf"),
	"best_step": -1,
	"wall_exceeded": False,
	"step50_warned": False,
	"step500_warned": False,
	"format_warned_at": -1,
	"grad_norm_warned_at": -1,
	"beat_rate_history": [],
	}

	class _RolloutCallback(TrainerCallback):

	# ------------------------------------------------------------------ #
	# Per-step instrumentation #
	# ------------------------------------------------------------------ #
	def on_step_end(self, args, state_, control, **kwargs): # noqa: D401
	entries, keys = cache.drain_step()
	if not entries:
	return
	step = state_.global_step

	# Group entries by prompt so we can compute within-group stats.
	groups: list[list[_ScoredCompletion]] = []
	current_prompt = None
	current: list[_ScoredCompletion] = []
	for (p, _), e in zip(keys, entries):
	if p != current_prompt and current:
	groups.append(current)
	current = []
	current_prompt = p
	current.append(e)
	if current:
	groups.append(current)

	# ----- Tier-1 metrics ----- #
	totals = [e.weighted_total for e in entries]
	if not totals:
	return
	mean_total = sum(totals) / len(totals)

	within_stds: list[float] = []
	uniques: list[int] = []
	for grp in groups:
	if len(grp) < 2:
	within_stds.append(0.0)
	uniques.append(1)
	continue
	vals = [e.weighted_total for e in grp]
	mu = sum(vals) / len(vals)
	var = sum((v - mu) ** 2 for v in vals) / len(vals)
	within_stds.append(var ** 0.5)
	key_set = {(tuple(e.x_pred), tuple(e.z_pred)) for e in grp}
	uniques.append(len(key_set))
	mean_within_std = sum(within_stds) / max(1, len(within_stds))
	mean_unique = sum(uniques) / max(1, len(uniques))

	# GRPO advantage (recomputed locally for the log only).
	adv_abs: list[float] = []
	for grp in groups:
	if len(grp) < 2:
	continue
	vals = [e.weighted_total for e in grp]
	mu = sum(vals) / len(vals)
	var = sum((v - mu) ** 2 for v in vals) / len(vals)
	std = max((var ** 0.5), 1e-4)
	adv_abs.extend(abs((v - mu) / std) for v in vals)
	mean_adv_abs = sum(adv_abs) / max(1, len(adv_abs))

	wandb_utils.log({
	"train/total_reward_mean": mean_total,
	"train/reward_std_within_group": mean_within_std,
	"train/completion_uniqueness": mean_unique,
	"train/advantage_mean_abs": mean_adv_abs,
	"train/global_step": step,
	}, step=step)

	wandb_utils.log_reward_breakdown(
	[e.rewards for e in entries], step=step, prefix="train",
	)

	wandb_utils.log({
	"train/reward_bounds_violations_total": cache._bounds_violations,
	"train/env_episodes_total": cache._episodes,
	"train/env_timeouts_total": cache._timeouts,
	}, step=step)

	# ----- Decision rule: step 50 within-group variance ----- #
	if (not state["step50_warned"]
	and step >= decision_thresholds["reward_std_check_step"]):
	if mean_within_std < decision_thresholds["reward_std_floor"]:
	print(f"\n[grpo-decision] CRITICAL @ step {step}: "
	f"train/reward_std_within_group={mean_within_std:.4f} "
	f"< {decision_thresholds['reward_std_floor']}. The "
	f"within-group reward std has collapsed; GRPO has "
	f"effectively zero advantage signal. Pausing for "
	f"manual review (warning only - no auto-action).")
	wandb_utils.log({
	"alarms/reward_std_collapse": 1.0,
	"alarms/reward_std_value": mean_within_std,
	}, step=step)
	state["step50_warned"] = True

	# Compliance Section 8 (audit, 2026-04): continuous warning
	# for reward_std < 0.02 at ANY step, not only step 50. We
	# throttle to once per 100 steps so the message doesn't
	# spam every 5-step log line. The existing step-50 gate
	# above stays as the harder "pause for review" check at
	# the higher 0.03 threshold; this continuous one fires
	# earlier the moment within-group variance crosses the
	# spec floor and tells the operator to look at the run.
	CONT_REWARD_STD_FLOOR = 0.02
	if mean_within_std < CONT_REWARD_STD_FLOOR:
	last_warn = state.get("reward_std_warned_at", -1)
	if step - last_warn >= 100:
	print(f"\n[grpo-warn] @ step {step}: "
	f"train/reward_std_within_group="
	f"{mean_within_std:.4f} < {CONT_REWARD_STD_FLOOR} "
	f"(continuous alarm). GRPO advantage signal is "
	f"vanishing - inspect generations / temperature.")
	wandb_utils.log({
	"alarms/reward_std_continuous_low": 1.0,
	"alarms/reward_std_value": mean_within_std,
	}, step=step)
	state["reward_std_warned_at"] = step

	# ----- Mode-collapse inspection hook ----- #
	for u in uniques:
	recent_uniques.append(u)
	if (inspection_every and step > 0
	and step % inspection_every == 0
	and len(recent_uniques) >= inspection_sample_n):
	last = list(recent_uniques)[-inspection_sample_n:]
	collapsed_count = sum(1 for u in last if u == 1)
	if collapsed_count > inspection_collapse_threshold:
	cur_temp = float(getattr(args, "temperature", 1.2))
	# Cap the bump at 2.0 - going higher does not actually
	# produce more diversity (sampler is already at top-k=50
	# / top-p=0.95) and every distinct value re-specializes
	# the torch.compile cache, eventually tripping
	# FailOnRecompileLimitHit even with raised limits.
	new_temp = min(2.0, cur_temp + temp_bump_on_collapse)
	if new_temp <= cur_temp + 1e-6:
	print(f"\n[grpo-inspection] WARN @ step {step}: "
	f"{collapsed_count}/{inspection_sample_n} prompts "
	f"collapsed but temperature already at cap "
	f"({cur_temp:.2f}); leaving unchanged.")
	else:
	print(f"\n[grpo-inspection] WARN @ step {step}: "
	f"{collapsed_count}/{inspection_sample_n} of the "
	f"most recent prompts had ALL 4 generations "
	f"identical. Bumping rollout temperature "
	f"{cur_temp:.2f} -> {new_temp:.2f}.")
	try:
	args.temperature = new_temp
	except Exception as exc:
	print(f"[grpo-inspection] could not patch temperature "
	f"on TRL args: {exc!r}")
	wandb_utils.log({
	"alarms/mode_collapse_count": collapsed_count,
	"train/temperature_after_bump": new_temp,
	}, step=step)

	# ----- Sample-completion table every sample_every steps ----- #
	if sample_every and step > 0 and step % sample_every == 0:
	rows_out = []
	# First sample_n unique prompts in this batch; emit a row per
	# generation (so the W&B table has gen_idx as a column).
	chosen_groups: list[tuple[str, list[_ScoredCompletion]]] = []
	seen_prompts: set = set()
	for (p, _), e in zip(keys, entries):
	if p in seen_prompts:
	for q, grp in chosen_groups:
	if q == p:
	grp.append(e)
	break
	continue
	if len(chosen_groups) >= sample_n:
	continue
	chosen_groups.append((p, [e]))
	seen_prompts.add(p)
	for prompt, grp in chosen_groups:
	for gi, e in enumerate(grp[:4]):
	rows_out.append({
	"step": step,
	"prompt": prompt[:600],
	"gen_idx": gi,
	"x_pred": ",".join(map(str, e.x_pred)),
	"z_pred": ",".join(map(str, e.z_pred)),
	"logical_correction":
	e.rewards.get("logical_correction", 0.0),
	"syndrome_consistency":
	e.rewards.get("syndrome_consistency", 0.0),
	"hamming_overlap":
	e.rewards.get("hamming_overlap", 0.0),
	"format_compliance":
	e.rewards.get("format_compliance", 0.0),
	"pymatching_beat":
	e.rewards.get("pymatching_beat", 0.0),
	"weighted_total": e.weighted_total,
	"parse_success": e.parse_success,
	"actual_obs_flip": e.actual_flip,
	"pm_obs_flip": e.pm_flip,
	"curriculum_level": e.curriculum_level,
	})
	if rows_out:
	wandb_utils.log_generation_table(
	rows_out, step=step, table_name="rl/generations",
	columns=[
	"step", "prompt", "gen_idx", "x_pred", "z_pred",
	"logical_correction", "syndrome_consistency",
	"hamming_overlap", "format_compliance",
	"pymatching_beat", "weighted_total",
	"parse_success", "actual_obs_flip", "pm_obs_flip",
	"curriculum_level",
	],
	)

	# ----- Wall-clock cap ----- #
	elapsed = time.time() - started_at
	if elapsed > wall_seconds and not state["wall_exceeded"]:
	state["wall_exceeded"] = True
	print(f"\n[grpo-walltime] wall-clock cap hit at step {step} "
	f"({elapsed:.0f}s > {wall_seconds:.0f}s). "
	f"Saving and exiting.")
	try:
	control.should_save = True
	control.should_training_stop = True
	except Exception:
	pass
	wandb_utils.log({
	"alarms/wall_exceeded": 1.0,
	"alarms/wall_seconds_at_cap": elapsed,
	}, step=step)

	# ----- Tier-2 + tier-3 eval ----- #
	if inloop_every and step > 0 and step % inloop_every == 0:
	self._run_inloop_eval(step)

	def on_log(self, args, state_, control, logs=None, **kwargs): # noqa: D401
	if not logs:
	return
	step = state_.global_step

	# Tier-1: surface train/* metrics that TRL itself produces.
	extra: dict = {}
	for src_key, dst_key in [
	("kl", "train/kl_divergence"),
	("train/kl_divergence", "train/kl_divergence"),
	("grad_norm", "train/grad_norm"),
	("loss", "train/policy_loss"),
	("learning_rate", "train/learning_rate"),
	]:
	if src_key in logs:
	try:
	extra[dst_key] = float(logs[src_key])
	except (TypeError, ValueError):
	pass
	if extra:
	wandb_utils.log(extra, step=step)

	# KL alarm.
	kl = logs.get("kl") or logs.get("train/kl_divergence")
	if kl is not None:
	try:
	kl_v = float(kl)
	except (TypeError, ValueError):
	kl_v = None
	if kl_v is not None and kl_v > kl_alarm:
	wandb_utils.log({
	"alarms/kl_alarm": 1.0,
	"alarms/kl_alarm_value": kl_v,
	}, step=step)
	print(f"[grpo][step {step}] KL ALARM: {kl_v:.3f} "
	f"> {kl_alarm:.3f} - inspect generations.")

	# Decision rule: grad_norm > ceil for N consecutive logs.
	gn = logs.get("grad_norm")
	if gn is not None:
	try:
	gn_v = float(gn)
	except (TypeError, ValueError):
	gn_v = None
	if gn_v is not None:
	grad_norm_run.append(gn_v)
	ceil = decision_thresholds["grad_norm_ceil"]
	run_len = decision_thresholds["grad_norm_run_len"]
	if (len(grad_norm_run) >= run_len
	and all(x > ceil for x in grad_norm_run)
	and step != state["grad_norm_warned_at"]):
	print(f"\n[grpo-decision] CRITICAL @ step {step}: "
	f"train/grad_norm > {ceil} for {run_len} "
	f"consecutive logs ({list(grad_norm_run)}). "
	f"Recommend reducing LR (warning only - no "
	f"auto-action).")
	wandb_utils.log({
	"alarms/grad_norm_runaway": 1.0,
	"alarms/grad_norm_value": gn_v,
	}, step=step)
	state["grad_norm_warned_at"] = step

	def on_train_end(self, args, state_, control, **kwargs): # noqa: D401
	self._run_inloop_eval(state_.global_step, table_name="rl/final_eval")

	# ------------------------------------------------------------------ #
	# Tier-2 / tier-3 eval (greedy, T=0) #
	# ------------------------------------------------------------------ #
	def _run_inloop_eval(self, step: int, table_name: str = "rl/inloop_eval"):
	try:
	from unsloth import FastLanguageModel
	FastLanguageModel.for_inference(model)
	except Exception:
	model.eval() # type: ignore[attr-defined]

	n = len(eval_rows)
	stats = {
	"logical_correction": 0,
	"format_success": 0,
	"format_partial": 0,
	"pymatching_beat": 0,
	"syndrome_consistency_pass": 0,
	"exact_match_pymatching": 0,
	"total_reward_sum": 0.0,
	"completion_len_sum": 0,
	"hard_lcr_num": 0,
	"hard_lcr_den": 0,
	"ler_d3_p001_logical_errors": 0,
	"ler_d3_p001_total": 0,
	"ler_d3_p001_rounds": 0,
	}
	preview_rows = []
	pad_id = tokenizer.pad_token_id or tokenizer.eos_token_id
	import torch
	from qubit_medic.config import REWARD_WEIGHTS

	for ep_idx, row in enumerate(eval_rows):
	prompt = row["prompt"]
	episode_id = int(row.get("episode_id", -1))
	try:
	chat = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(
	chat, tokenize=False, add_generation_prompt=True,
	)
	except Exception:
	text = ("<\|im_start\|>user\n" + prompt
	+ "\n<\|im_end\|>\n<\|im_start\|>assistant\n")
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	try:
	with torch.no_grad():
	out = model.generate(
	**inputs,
	max_new_tokens=inloop_max_new_tokens,
	do_sample=False, # greedy at T=0 per spec
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=pad_id,
	)
	gen_ids = out[0][inputs["input_ids"].shape[1]:]
	completion = tokenizer.decode(gen_ids, skip_special_tokens=True)
	n_tokens = int(gen_ids.shape[0])
	except Exception as exc: # pragma: no cover
	completion = f"<gen-error: {exc}>"
	n_tokens = 0

	# Score against the env. If episode_id has TTL'd we fall back
	# to a fresh reset so the run continues, but log nothing
	# special - the metric arithmetic is still correct.
	try:
	result = env_client.step(raw_response=completion,
	episode_id=episode_id)
	except Exception:
	obs2 = env_client.reset(seed=row.get("seed"))
	result = env_client.step(raw_response=completion,
	episode_id=obs2.episode_id)

	rwd = result.info.get("rewards", {}) or {}
	action = result.info.get("parsed_action", {}) or {}
	actual = int(result.info.get("actual_observable_flip", 0))
	pm_pred = int(result.info.get("pymatching_observable_pred", 0))
	we_correct = float(rwd.get("logical_correction", 0.0)) >= 0.5
	pm_correct = (pm_pred == actual)

	stats["logical_correction"] += int(we_correct)
	stats["format_success"] += int(action.get("parse_success", False))
	stats["format_partial"] += int(
	float(rwd.get("format_compliance", 0.0)) >= 0.5
	and not action.get("parse_success", False)
	)
	stats["pymatching_beat"] += int(
	float(rwd.get("pymatching_beat", 0.0)) >= 0.5)
	stats["syndrome_consistency_pass"] += int(
	float(rwd.get("syndrome_consistency", 0.0)) >= 0.999)
	weighted = sum(
	weight * max(0.0, min(1.0, float(rwd.get(name, 0.0))))
	for name, weight in REWARD_WEIGHTS.items()
	)
	stats["total_reward_sum"] += max(0.0, min(1.0, weighted))
	stats["completion_len_sum"] += n_tokens

	pm_x = sorted(set(map(int,
	result.info.get("pymatching_x_errors", []) or [])))
	pm_z = sorted(set(map(int,
	result.info.get("pymatching_z_errors", []) or [])))
	our_x = sorted(set(map(int,
	action.get("x_error_qubits", []) or [])))
	our_z = sorted(set(map(int,
	action.get("z_error_qubits", []) or [])))
	if (action.get("parse_success", False)
	and pm_x == our_x and pm_z == our_z):
	stats["exact_match_pymatching"] += 1

	# Hard syndrome: >=2 stabilizers fired (anti-hacking spec
	# forbids exposing true_x/true_z, so we use the syndrome
	# bit count from the cached eval row as the proxy).
	n_active = sum(1 for b in row.get("syndrome_bits", []) if int(b))
	if n_active >= 2:
	stats["hard_lcr_den"] += 1
	stats["hard_lcr_num"] += int(we_correct)

	# tier-3: per-round LER for d=3 / p=0.001 only.
	d = int(row.get("distance", 0))
	rnds = max(1, int(row.get("rounds", 0)))
	if d == 3 and abs(float(row.get("p", 0.0)) - 0.001) < 1e-6:
	stats["ler_d3_p001_total"] += 1
	stats["ler_d3_p001_rounds"] = rnds
	if not we_correct:
	stats["ler_d3_p001_logical_errors"] += 1

	if ep_idx < 4:
	preview_rows.append({
	"step": step,
	"episode": ep_idx,
	"completion": completion[:300],
	"logical_correction": rwd.get("logical_correction", 0.0),
	"syndrome_consistency": rwd.get("syndrome_consistency", 0.0),
	"format_compliance": rwd.get("format_compliance", 0.0),
	"pymatching_beat": rwd.get("pymatching_beat", 0.0),
	"weighted_total": weighted,
	})

	denom = max(1, n)
	lcr = stats["logical_correction"] / denom
	beat_rate = stats["pymatching_beat"] / denom
	fmt_compliance = stats["format_success"] / denom
	hard_lcr = (stats["hard_lcr_num"] / max(1, stats["hard_lcr_den"])
	if stats["hard_lcr_den"] else 0.0)
	sync_consistency_rate = stats["syndrome_consistency_pass"] / denom
	avg_completion_len = stats["completion_len_sum"] / denom
	mean_total_reward = stats["total_reward_sum"] / denom
	exact_match = stats["exact_match_pymatching"] / denom

	# Tier-3 LER per round, log10.
	ler_per_round = None
	ler_log10 = None
	if stats["ler_d3_p001_total"] > 0:
	p_logical = (stats["ler_d3_p001_logical_errors"]
	/ stats["ler_d3_p001_total"])
	rounds = max(1, stats["ler_d3_p001_rounds"])
	# Per-round LER: 1 - (1 - p_logical)^(1/rounds).
	ler_per_round = 1.0 - (1.0 - max(0.0, min(1.0, p_logical))) ** (1.0 / rounds)
	if ler_per_round > 0:
	import math
	ler_log10 = math.log10(max(ler_per_round, 1e-12))

	# Tier-2 output diversity probe at T=1.0 (8 samples per prompt
	# on a small subset to keep eval fast).
	div_probe_n = min(8, len(eval_rows))
	div_samples = 8
	unique_counts: list[int] = []
	for row in eval_rows[:div_probe_n]:
	prompt = row["prompt"]
	try:
	chat = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(
	chat, tokenize=False, add_generation_prompt=True,
	)
	except Exception:
	text = ("<\|im_start\|>user\n" + prompt
	+ "\n<\|im_end\|>\n<\|im_start\|>assistant\n")
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outs = []
	for _ in range(div_samples):
	try:
	with torch.no_grad():
	out = model.generate(
	**inputs,
	max_new_tokens=inloop_max_new_tokens,
	do_sample=True,
	temperature=1.0,
	top_p=0.95,
	top_k=50,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=pad_id,
	)
	gen = tokenizer.decode(
	out[0][inputs["input_ids"].shape[1]:],
	skip_special_tokens=True,
	).strip()
	except Exception:
	gen = ""
	outs.append(gen)
	unique_counts.append(len(set(outs)))
	output_diversity_t1 = (sum(unique_counts) / max(1, len(unique_counts))
	if unique_counts else 0.0)

	eval_metrics = {
	"eval/logical_correction_rate": lcr,
	"eval/pymatching_beat_rate": beat_rate,
	"eval/format_compliance": fmt_compliance,
	"eval/exact_match_pymatching": exact_match,
	"eval/hard_syndrome_lcr": hard_lcr,
	"eval/syndrome_consistency_rate": sync_consistency_rate,
	"eval/avg_completion_length": avg_completion_len,
	"eval/output_diversity_temp_1": output_diversity_t1,
	"eval/total_reward_mean": mean_total_reward,
	"eval/episodes": denom,
	}
	if ler_per_round is not None:
	eval_metrics["eval/ler_per_round_d3_p001"] = ler_per_round
	if ler_log10 is not None:
	eval_metrics["eval/ler_per_round_log10"] = ler_log10

	print(f"[grpo][eval@{step}] " + ", ".join(
	f"{k.split('/')[-1]}={v:.4f}" if isinstance(v, float)
	else f"{k.split('/')[-1]}={v}" for k, v in eval_metrics.items()
	))
	wandb_utils.log(eval_metrics, step=step)
	if preview_rows:
	wandb_utils.log_generation_table(
	preview_rows, step=step, table_name=table_name,
	)

	# Decision rule: step-500 pymatching_beat sanity.
	state["beat_rate_history"].append(beat_rate)
	if len(state["beat_rate_history"]) > 5:
	state["beat_rate_history"] = state["beat_rate_history"][-5:]
	if (not state["step500_warned"]
	and step >= decision_thresholds["beat_rate_check_step"]
	and len(state["beat_rate_history"]) >= 5
	and all(b == 0 for b in state["beat_rate_history"])):
	print(f"\n[grpo-decision] WARN @ step {step}: "
	f"eval/pymatching_beat_rate has been 0.0 across the last "
	f"5 evals. The model is never finding syndromes where "
	f"PyMatching fails - consider increasing the "
	f"pymatching_beat reward weight (warning only).")
	wandb_utils.log({"alarms/zero_beat_rate": 1.0}, step=step)
	state["step500_warned"] = True

	# Decision rule: format_compliance < floor.
	if (fmt_compliance < decision_thresholds["format_floor"]
	and step != state["format_warned_at"]):
	print(f"\n[grpo-decision] WARN @ step {step}: "
	f"eval/format_compliance={fmt_compliance:.3f} < "
	f"{decision_thresholds['format_floor']}. Consider "
	f"increasing format_compliance weight (warning only).")
	wandb_utils.log({
	"alarms/format_below_floor": 1.0,
	"alarms/format_value": fmt_compliance,
	}, step=step)
	state["format_warned_at"] = step

	# ----- Best-checkpoint tracking ----- #
	if mean_total_reward > state["best_total_reward"]:
	old = state["best_total_reward"]
	state["best_total_reward"] = mean_total_reward
	state["best_step"] = step
	print(f"[grpo][eval@{step}] new best total_reward_mean="
	f"{mean_total_reward:.4f} (prev {old:.4f}); "
	f"saving to {best_dir}")
	try:
	if best_dir.exists():
	shutil.rmtree(best_dir)
	best_dir.mkdir(parents=True, exist_ok=True)
	model.save_pretrained(str(best_dir))
	tokenizer.save_pretrained(str(best_dir))
	wandb_utils.update_summary({
	"best/total_reward_mean": mean_total_reward,
	"best/step": step,
	})
	except Exception as exc:
	print(f"[grpo] WARN: failed to save best checkpoint: "
	f"{exc!r}", file=sys.stderr)

	# Switch back to training mode.
	try:
	from unsloth import FastLanguageModel
	FastLanguageModel.for_training(model)
	except Exception:
	model.train() # type: ignore[attr-defined]

	return _RolloutCallback()


	# --------------------------------------------------------------------------- #
	# Dataset of prompts #
	# --------------------------------------------------------------------------- #


	def _build_prompt_pool(env_client, n: int):
	prompts = []
	for _ in range(n):
	obs = env_client.reset()
	prompts.append({"prompt": obs.prompt, "episode_id": obs.episode_id})
	return prompts


	# --------------------------------------------------------------------------- #
	# Main #
	# --------------------------------------------------------------------------- #


	def main(argv: Iterable[str] = ()) -> int:
	parser = argparse.ArgumentParser(description=__doc__)
	parser.add_argument("--sft-checkpoint", type=str, default=None,
	help="LoRA adapter directory to start GRPO from. "
	"Defaults to config.SFT_CHECKPOINT_PATH_FOR_GRPO "
	"(checkpoints/sft_warmup/checkpoint-50).")
	parser.add_argument("--output", type=str, default="checkpoints/grpo")
	parser.add_argument("--model", type=str,
	default=os.getenv(
	"QUBIT_MEDIC_MODEL",
	"unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit"),
	help="Base model. Defaults to the 4-bit unsloth bundle "
	"matching the SFT base.")
	parser.add_argument("--steps", type=int, default=None)
	parser.add_argument("--gen-per-prompt", type=int, default=None)
	parser.add_argument("--lr", type=float, default=None)
	parser.add_argument("--kl-coef", type=float, default=None)
	parser.add_argument("--max-prompt-len", type=int, default=None)
	parser.add_argument("--max-completion-len", type=int, default=None)
	parser.add_argument("--seed", type=int, default=None)
	parser.add_argument("--report-to", type=str, default="wandb")
	parser.add_argument("--prompt-pool", type=int, default=512)
	parser.add_argument("--wandb-run-name", type=str, default=None)
	parser.add_argument("--wandb-group", type=str, default=None)
	parser.add_argument("--wandb-tags", type=str, nargs="*", default=("grpo",))
	parser.add_argument("--wandb-notes", type=str, default=None)
	parser.add_argument("--sample-every", type=int, default=None)
	parser.add_argument("--sample-n", type=int, default=None)
	parser.add_argument("--inloop-eval-every", type=int, default=None)
	parser.add_argument("--inloop-eval-episodes", type=int, default=None)
	parser.add_argument("--kl-alarm", type=float, default=None)
	parser.add_argument("--no-artifact", action="store_true")
	parser.add_argument("--skip-preflight", action="store_true",
	help="Skip the diversity preflight (DEBUG ONLY)")
	args = parser.parse_args(list(argv))

	# Lazy heavy imports.
	try:
	from unsloth import FastLanguageModel
	except ImportError:
	print("ERROR: unsloth not installed. "
	"Run `pip install -r requirements-train.txt`", file=sys.stderr)
	return 1
	import torch
	from datasets import Dataset
	from trl import GRPOConfig, GRPOTrainer

	# Belt-and-suspenders for the dynamo recompile-limit crash that killed a
	# previous run at step 400. Env vars at the top of the file cover the
	# case where torch hasn't been imported yet; this block covers the case
	# where unsloth/torch were already imported (env vars no-op then) and
	# also flips suppress_errors so any future overflow falls back to eager
	# instead of raising FailOnRecompileLimitHit.
	try:
	import torch._dynamo as _dynamo
	for _attr in ("cache_size_limit", "recompile_limit",
	"accumulated_cache_size_limit"):
	if hasattr(_dynamo.config, _attr):
	setattr(_dynamo.config, _attr,
	max(256, getattr(_dynamo.config, _attr)))
	_dynamo.config.suppress_errors = True
	except Exception as _exc: # pragma: no cover - defensive
	print(f"[grpo-guard] WARNING: could not raise dynamo limits: "
	f"{_exc!r}", file=sys.stderr)

	# Pre-flight signature check + stale-cache wipe.
	_wipe_stale_grpo_cache()
	_assert_grpo_signature_compatible()

	from qubit_medic import wandb_utils
	from qubit_medic.client.client import make_default_client
	from qubit_medic.config import (
	GRPO_BATCH_SIZE, GRPO_CHECKPOINT_EVERY, GRPO_DECISION_BEAT_RATE_CHECK_STEP,
	GRPO_DECISION_FORMAT_FLOOR, GRPO_DECISION_GRAD_NORM_CEIL,
	GRPO_DECISION_GRAD_NORM_RUN_LEN, GRPO_DECISION_REWARD_STD_CHECK_STEP,
	GRPO_DECISION_REWARD_STD_FLOOR, GRPO_DO_SAMPLE, GRPO_GEN_PER_PROMPT,
	GRPO_GRAD_ACCUM, GRPO_INSPECTION_COLLAPSE_THRESHOLD,
	GRPO_INSPECTION_HOOK_EVERY, GRPO_INSPECTION_SAMPLE_N, GRPO_KL_ALARM,
	GRPO_KL_COEF, GRPO_LOG_EVERY, GRPO_LR, GRPO_LR_SCHEDULER,
	GRPO_MAX_COMPLETION_LEN, GRPO_MAX_PROMPT_LEN, GRPO_OPTIMIZER,
	GRPO_REPETITION_PENALTY, GRPO_SAMPLE_LOG_EVERY, GRPO_SAMPLE_LOG_N,
	GRPO_SAVE_TOTAL_LIMIT, GRPO_STEPS, GRPO_TEMP_BUMP_ON_COLLAPSE,
	GRPO_TEMPERATURE, GRPO_TOP_K, GRPO_TOP_P, GRPO_VAL_EPISODES,
	GRPO_VAL_PATH, GRPO_VAL_SEED, GRPO_WALL_SECONDS, LORA_ALPHA, LORA_DROPOUT,
	LORA_R, LORA_TARGET_MODULES, MODEL_ID, PRIMARY_SEED, REWARD_WEIGHTS,
	SFT_CHECKPOINT_PATH_FOR_GRPO, WANDB_INLOOP_EVAL_EPISODES,
	WANDB_INLOOP_EVAL_EVERY,
	)

	sft_ckpt = args.sft_checkpoint or SFT_CHECKPOINT_PATH_FOR_GRPO
	steps = args.steps if args.steps is not None else GRPO_STEPS
	gen_per_prompt = args.gen_per_prompt if args.gen_per_prompt is not None else GRPO_GEN_PER_PROMPT
	lr = args.lr if args.lr is not None else GRPO_LR
	kl_coef = args.kl_coef if args.kl_coef is not None else GRPO_KL_COEF
	max_p = args.max_prompt_len if args.max_prompt_len is not None else GRPO_MAX_PROMPT_LEN
	max_c = args.max_completion_len if args.max_completion_len is not None else GRPO_MAX_COMPLETION_LEN
	seed = args.seed if args.seed is not None else PRIMARY_SEED
	sample_every = args.sample_every if args.sample_every is not None else GRPO_SAMPLE_LOG_EVERY
	sample_n = args.sample_n if args.sample_n is not None else GRPO_SAMPLE_LOG_N
	inloop_every = args.inloop_eval_every if args.inloop_eval_every is not None else WANDB_INLOOP_EVAL_EVERY
	inloop_episodes = args.inloop_eval_episodes if args.inloop_eval_episodes is not None else WANDB_INLOOP_EVAL_EPISODES
	kl_alarm = args.kl_alarm if args.kl_alarm is not None else GRPO_KL_ALARM

	_seed_everything(seed)

	# ---- Env client --------------------------------------------------- #
	env_client = make_default_client()
	print(f"using env client: {type(env_client).__name__}; "
	f"health = {env_client.health()}")

	# ---- W&B init ----------------------------------------------------- #
	report_to = wandb_utils.derive_report_to(args.report_to)
	run_name = args.wandb_run_name or wandb_utils.make_run_name("grpo")
	wandb_utils.init_run(
	run_name=run_name,
	job_type="grpo",
	tags=args.wandb_tags,
	notes=args.wandb_notes,
	group=args.wandb_group,
	extra_config={
	"cli": {
	"steps": steps,
	"gen_per_prompt": gen_per_prompt,
	"lr": lr,
	"kl_coef": kl_coef,
	"max_prompt_len": max_p,
	"max_completion_len": max_c,
	"prompt_pool": args.prompt_pool,
	"sample_every": sample_every,
	"sample_n": sample_n,
	"inloop_eval_every": inloop_every,
	"inloop_eval_episodes": inloop_episodes,
	"kl_alarm": kl_alarm,
	"temperature": GRPO_TEMPERATURE,
	"top_p": GRPO_TOP_P,
	"top_k": GRPO_TOP_K,
	"repetition_penalty": GRPO_REPETITION_PENALTY,
	"do_sample": GRPO_DO_SAMPLE,
	"lr_scheduler": GRPO_LR_SCHEDULER,
	"optimizer": GRPO_OPTIMIZER,
	"grad_accum": GRPO_GRAD_ACCUM,
	"effective_batch": GRPO_BATCH_SIZE * GRPO_GRAD_ACCUM,
	"sft_checkpoint": sft_ckpt,
	"model": args.model,
	"seed": seed,
	"report_to": report_to,
	"wall_seconds": GRPO_WALL_SECONDS,
	"reward_weights": dict(REWARD_WEIGHTS),
	"val_seed": GRPO_VAL_SEED,
	"val_episodes": GRPO_VAL_EPISODES,
	},
	},
	)
	# Use train/global_step as default x-axis for everything we log.
	try:
	run = wandb_utils.get_run()
	if run is not None:
	run.define_metric("train/global_step")
	run.define_metric("train/*", step_metric="train/global_step")
	run.define_metric("eval/*", step_metric="train/global_step")
	run.define_metric("alarms/*", step_metric="train/global_step")
	run.define_metric("rl/*", step_metric="train/global_step")
	run.define_metric("best/*", step_metric="train/global_step")
	except Exception as exc:
	print(f"[wandb] could not define x-axis metric: {exc!r}", file=sys.stderr)

	# ---- Build prompt pool -------------------------------------------- #
	print(f"pre-generating {args.prompt_pool} prompts ...")
	prompts = _build_prompt_pool(env_client, args.prompt_pool)
	dataset = Dataset.from_list(prompts)
	print(f" built dataset with {len(dataset)} prompts")

	# ---- Frozen eval set --------------------------------------------- #
	eval_rows = _load_or_build_eval_set(
	env_client, seed=GRPO_VAL_SEED, n=inloop_episodes, path=GRPO_VAL_PATH,
	)

	# ---- Load model --------------------------------------------------- #
	print(f"loading base={args.model}, sft adapter={sft_ckpt}")
	base_for_load = sft_ckpt if Path(sft_ckpt).exists() else args.model
	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name=base_for_load,
	max_seq_length=max_p + max_c,
	load_in_4bit=True,
	dtype=None,
	)
	if not Path(sft_ckpt).exists():
	print(f"[grpo] WARN: SFT checkpoint {sft_ckpt} not found; "
	f"attaching fresh LoRA on the base model")
	model = FastLanguageModel.get_peft_model(
	model,
	r=LORA_R,
	lora_alpha=LORA_ALPHA,
	target_modules=list(LORA_TARGET_MODULES),
	lora_dropout=LORA_DROPOUT,
	bias="none",
	use_gradient_checkpointing="unsloth",
	random_state=seed,
	)

	# ---- Diversity preflight ----------------------------------------- #
	if not args.skip_preflight:
	ok = _diversity_preflight(
	model, tokenizer,
	val_path="data/sft_validation.jsonl",
	n_prompts=5, n_samples_per_prompt=8,
	temperature=GRPO_TEMPERATURE,
	min_unique=3, min_passing=3,
	max_new_tokens=max_c,
	)
	if not ok:
	wandb_utils.update_summary({"preflight/passed": False})
	wandb_utils.finish_run()
	return 2
	wandb_utils.update_summary({"preflight/passed": True})
	else:
	print("[grpo] --skip-preflight given; bypassing diversity preflight "
	"(DEBUG ONLY)")

	# ---- TRL GRPOConfig ---------------------------------------------- #
	output_dir = Path(args.output)
	output_dir.mkdir(parents=True, exist_ok=True)
	best_dir = output_dir / "best"
	final_dir = output_dir / "final"
	bf16_supported = (
	torch.cuda.is_available() and torch.cuda.is_bf16_supported()
	)

	grpo_kwargs: dict = {
	"output_dir": str(output_dir),
	"max_steps": steps,
	"per_device_train_batch_size": GRPO_BATCH_SIZE,
	"gradient_accumulation_steps": GRPO_GRAD_ACCUM,
	"num_generations": gen_per_prompt,
	"max_prompt_length": max_p,
	"max_completion_length": max_c,
	"learning_rate": lr,
	"beta": kl_coef,
	"lr_scheduler_type": GRPO_LR_SCHEDULER,
	"optim": GRPO_OPTIMIZER,
	"logging_steps": GRPO_LOG_EVERY,
	"save_steps": GRPO_CHECKPOINT_EVERY,
	"save_total_limit": GRPO_SAVE_TOTAL_LIMIT,
	"save_only_model": False,
	"seed": seed,
	"bf16": bf16_supported,
	"fp16": torch.cuda.is_available() and not bf16_supported,
	"report_to": report_to,
	"run_name": run_name,
	# Diversity-focused rollout sampling.
	"temperature": GRPO_TEMPERATURE,
	"top_p": GRPO_TOP_P,
	"top_k": GRPO_TOP_K,
	"repetition_penalty": GRPO_REPETITION_PENALTY,
	}

	# Some TRL versions don't accept every sampling kwarg on GRPOConfig;
	# fall back gracefully so the script still runs.
	config = None
	dropped: list[str] = []
	while config is None:
	try:
	config = GRPOConfig(**grpo_kwargs)
	except TypeError as exc:
	msg = str(exc)
	removed = False
	for k in ("repetition_penalty", "top_k", "top_p", "temperature",
	"save_only_model"):
	if k in msg and k in grpo_kwargs:
	grpo_kwargs.pop(k)
	dropped.append(k)
	removed = True
	break
	if not removed:
	raise
	if dropped:
	print(f"[grpo] WARN: TRL did not accept these GRPOConfig kwargs and "
	f"they were dropped: {dropped}. Using TRL defaults for them.")

	# ---- Reward functions + scoring cache ----------------------------- #
	cache = _BatchScoringCache(env_client=env_client,
	reward_weights=dict(REWARD_WEIGHTS))
	reward_fns = _make_reward_fns(cache)
	# The first reward is the bounded weighted-total used for the gradient;
	# the rest are zero-weight observers used only for per-component traces.
	reward_weights = [1.0] + [0.0] * len(_REWARD_COMPONENTS)

	callbacks = []
	cb = _build_wandb_callback(
	cache, model, tokenizer, env_client, eval_rows,
	sample_every=sample_every, sample_n=sample_n,
	inloop_every=inloop_every,
	inloop_max_new_tokens=max_c,
	kl_alarm=kl_alarm,
	inspection_every=GRPO_INSPECTION_HOOK_EVERY,
	inspection_sample_n=GRPO_INSPECTION_SAMPLE_N,
	inspection_collapse_threshold=GRPO_INSPECTION_COLLAPSE_THRESHOLD,
	temp_bump_on_collapse=GRPO_TEMP_BUMP_ON_COLLAPSE,
	best_dir=best_dir, output_dir=output_dir,
	wall_seconds=GRPO_WALL_SECONDS,
	decision_thresholds={
	"reward_std_floor": GRPO_DECISION_REWARD_STD_FLOOR,
	"reward_std_check_step": GRPO_DECISION_REWARD_STD_CHECK_STEP,
	"beat_rate_check_step": GRPO_DECISION_BEAT_RATE_CHECK_STEP,
	"format_floor": GRPO_DECISION_FORMAT_FLOOR,
	"grad_norm_ceil": GRPO_DECISION_GRAD_NORM_CEIL,
	"grad_norm_run_len": GRPO_DECISION_GRAD_NORM_RUN_LEN,
	},
	)
	if cb is not None:
	callbacks.append(cb)

	# Older TRL versions: GRPOTrainer may not accept reward_weights kw.
	trainer_kwargs = dict(
	model=model,
	processing_class=tokenizer,
	args=config,
	train_dataset=dataset,
	reward_funcs=reward_fns,
	reward_weights=reward_weights,
	callbacks=callbacks,
	)
	try:
	trainer = GRPOTrainer(**trainer_kwargs)
	except TypeError as exc:
	if "reward_weights" in str(exc):
	print("[grpo] WARN: this TRL does not accept reward_weights= "
	"on GRPOTrainer; falling back to using only the bounded "
	"weighted-total reward (and no observers).")
	trainer_kwargs.pop("reward_weights")
	trainer_kwargs["reward_funcs"] = [reward_fns[0]]
	trainer = GRPOTrainer(**trainer_kwargs)
	else:
	raise

	print(f"running GRPO for {steps} steps "
	f"(temperature={GRPO_TEMPERATURE}, top_p={GRPO_TOP_P}, "
	f"top_k={GRPO_TOP_K}, repetition_penalty={GRPO_REPETITION_PENALTY}, "
	f"beta={kl_coef}, lr={lr}) ...")
	started = time.time()
	train_result = trainer.train()
	elapsed = time.time() - started
	print(f"finished in {elapsed:.1f}s")

	metrics = getattr(train_result, "metrics", {}) or {}
	wandb_utils.update_summary({
	"grpo/wall_seconds": elapsed,
	"grpo/total_episodes": cache._episodes,
	"grpo/total_timeouts": cache._timeouts,
	"grpo/reward_bounds_violations": cache._bounds_violations,
	**{f"grpo/final/{k}": v for k, v in metrics.items()
	if isinstance(v, (int, float))},
	})

	# ---- Final + rolling adapter saves ------------------------------- #
	print(f"saving rolling adapter snapshot to {output_dir}")
	model.save_pretrained(str(output_dir))
	tokenizer.save_pretrained(str(output_dir))

	print(f"saving final adapter snapshot to {final_dir}")
	final_dir.mkdir(parents=True, exist_ok=True)
	model.save_pretrained(str(final_dir))
	tokenizer.save_pretrained(str(final_dir))

	if not args.no_artifact:
	wandb_utils.log_artifact(
	str(final_dir),
	name=f"grpo-final-{run_name}",
	artifact_type="model",
	description="GRPO final LoRA adapter (Qubit-Medic).",
	)
	if best_dir.exists():
	wandb_utils.log_artifact(
	str(best_dir),
	name=f"grpo-best-{run_name}",
	artifact_type="model",
	description="GRPO best-eval LoRA adapter (Qubit-Medic).",
	)

	wandb_utils.finish_run()
	return 0


	if __name__ == "__main__":
	sys.exit(main(sys.argv[1:]))