Baladithya Balamurugan

Wave 2: 4 new modules (kill-switch, EKS/SageMaker executors, DockerSandbox) + B4/B7 completion

7a55e1e 6 days ago

7.29 kB

	# Overview — Composer 2.5 Replication Framework (5-minute read)

	*Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For
	the front-door pitch see [`README.md`](../README.md); for the honest gap list see
	[`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see
	[`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).*

	## What it is

	An open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)
	recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
	coder — generalized so it runs on any HuggingFace causal LM with a chat template (Qwen,
	Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
	(`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives,
	recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
	scope for v0.

	This repo is the methodology repo ("the paper of the project"). Trained-variant model
	repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md).

	## The three channels — with honest provenance

	The framework composes a single training loss out of three additive channels. **Two replicate
	Cursor's published recipe; the third is the framework's own research addition.** Getting this
	provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md).

	\| # \| Channel \| What it is \| Provenance \|
	\|---\|---\|---\|---\|
	\| 1 \| Base policy optimization \| RL on verifiable rewards (RLVR). Default Dr.GRPO, now a selectable menu (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). \| ✅ Genuine replication. Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. \|
	\| 2 \| SDPO self-distillation \| Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a self-teacher → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). \| ✅ Genuine replication. This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. \|
	\| 3 \| Trace-replay-DPO \| Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). \| ⚠️ The framework's OWN additive research channel — NOT part of Cursor's recipe. Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks on top of the genuine replication; it does not define it. \|

	> Read this before citing the framework. Any statement of the form "Composer does
	> trace-replay-DPO" or "the replication target includes channel 3" is wrong. Cursor's
	> recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.

	The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`;
	production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass),
	where channel 1 is real GRPO rather than the LM-CE stub. See
	[`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md).

	## What's proven

	- CPU SDPO-fires. On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
	(`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just
	memorization?" critique is closed (Spike 006-strict).
	- Real GPU run. Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss
	0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
	- A1 8B-ladder Modal run. The GRPO-only arm (A1) of the LMA channel ladder has a real
	Modal runner and has been run with `dr_grpo`.
	- GSM8K GRPO. The `examples/gsm8k_grpo*` end-to-end examples exercise the production
	trainer on a real reasoning benchmark.
	- Economic feasibility of channel 3. 150 real OpenRouter calls, $0.98/trace mean, 0
	errors (Spike 001).
	- Installable + tested. `pip install -e .` works; 266 passing / 62 skipped (measured 2026-06-09;
	canonical count + why skips vary by env: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).

	## What's gapped (honest, NOT closed)

	1. Docker / TorchForge substrate E2E is hardware-blocked — the test exists and skips
	cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
	2. The full 8B LMA channel ladder (A2–A4) is not yet runnable. Only A1 (GRPO-only)
	has a real Modal runner. A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold +
	plan-builder only — running them on a real 8B checkpoint additionally needs a real
	error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that
	don't exist yet. The real 8B run is additionally user-budget-gated.
	3. The empirical question — does the method actually beat plain GRPO at scale? — is the
	GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.

	See [`BACKLOG.md`](../BACKLOG.md) for the live gap list, the **Foot-guns worth knowing
	on day one** section just below for the day-one gotchas (branch sync, `strip_thinking`,
	k1/k3, `compose_loss`-is-harness), and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
	for install/runtime failure modes.

	## Foot-guns worth knowing on day one

	- Branch sync (resolved 2026-06-09). `main` is canonical and kept in sync with `master`,
	so a fresh Hub clone of `main` installs the complete tree. If you ever `ImportError` on
	`make_dr_grpo_config`, your clone is stale (`git fetch && git checkout main`). Historically
	`main` lagged `master`; that's fixed as long as both stay synced.
	- `strip_thinking` × SDPO. On real agent traces, SDPO requires `strip_thinking=False`:
	~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.
	- KL estimator delta. TRL uses the k3 estimator; Composer's report describes k1.
	This is a documented, intentional delta — the framework does not silently claim k1 parity.
	- `compose_loss` is the verification harness, not production. Its channel-1 is an LM-CE
	stub, not real GRPO. Production training is `ComposerReplicationTrainer`.

	## Where to go next

	\| You want to… \| Read \|
	\|---\|---\|
	\| Pitch / status / roadmap \| [`README.md`](../README.md) \|
	\| Run it end-to-end \| [`docs/USER_GUIDE.md`](USER_GUIDE.md) \|
	\| Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch \| [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) \|
	\| Exact kwargs / signatures \| [`docs/API_REFERENCE.md`](API_REFERENCE.md) \|
	\| Why each design decision \| [`docs/adrs/README.md`](adrs/README.md) \|
	\| How Cursor's recipe maps to our components \| [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) \|
	\| Honest gaps / open work \| [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) \|
	\| Fix a broken install / run \| [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) \|