Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 2: 4 new modules (kill-switch, EKS/SageMaker executors, DockerSandbox) + B4/B7 completion
7a55e1e | # Overview — Composer 2.5 Replication Framework (5-minute read) | |
| *Current through [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md) (2026-06). For | |
| the front-door pitch see [`README.md`](../README.md); for the honest gap list see | |
| [`BACKLOG.md`](../BACKLOG.md); for the clause-by-clause vision audit see | |
| [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md).* | |
| ## What it is | |
| An **open, methodology-first replication of Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5)** | |
| recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic | |
| coder — generalized so it runs on **any HuggingFace causal LM with a chat template** (Qwen, | |
| Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package | |
| (`pip install -e .` → `composer_replication`) plus a research corpus (ADRs, deep-dives, | |
| recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of | |
| scope for v0. | |
| This repo is the **methodology repo** ("the paper of the project"). Trained-variant model | |
| repos and trace datasets are split out per [`docs/HF_REPO_LAYOUT.md`](HF_REPO_LAYOUT.md). | |
| ## The three channels — with honest provenance | |
| The framework composes a single training loss out of three additive channels. **Two replicate | |
| Cursor's published recipe; the third is the framework's own research addition.** Getting this | |
| provenance right is the whole point — see [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md). | |
| | # | Channel | What it is | Provenance | | |
| |---|---|---|---| | |
| | **1** | **Base policy optimization** | RL on verifiable rewards (RLVR). Default **Dr.GRPO**, now a **selectable menu** (`make_po_config(objective=…)` over `{grpo, dr_grpo, bnpo, dapo, gspo, cispo}`, per ADR-014). | ✅ **Genuine replication.** Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. | | |
| | **2** | **SDPO self-distillation** | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a *self-teacher* → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ **Genuine replication.** This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. | | |
| | **3** | **Trace-replay-DPO** | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder ([ADR-013](adrs/ADR-013-lma-integration-channel-ladder.md)). | ⚠️ **The framework's OWN additive research channel — NOT part of Cursor's recipe.** Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks *on top of* the genuine replication; it does not define it. | | |
| > **Read this before citing the framework.** Any statement of the form "Composer does | |
| > trace-replay-DPO" or "the replication target includes channel 3" is **wrong**. Cursor's | |
| > recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so. | |
| The full loss (verification-harness form) is `total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo`; | |
| production uses `ComposerReplicationTrainer._compute_loss` (a real `trl.GRPOTrainer` subclass), | |
| where channel 1 is real GRPO rather than the LM-CE stub. See | |
| [`docs/USER_GUIDE.md`](USER_GUIDE.md) and [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md). | |
| ## What's proven | |
| - **CPU SDPO-fires.** On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires | |
| (`sdpo_jsd > 0`) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just | |
| memorization?" critique is closed (Spike 006-strict). | |
| - **Real GPU run.** Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss | |
| 0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini). | |
| - **A1 8B-ladder Modal run.** The GRPO-only arm (A1) of the LMA channel ladder has a real | |
| Modal runner and has been run with `dr_grpo`. | |
| - **GSM8K GRPO.** The `examples/gsm8k_grpo*` end-to-end examples exercise the production | |
| trainer on a real reasoning benchmark. | |
| - **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0 | |
| errors (Spike 001). | |
| - **Installable + tested.** `pip install -e .` works; **266 passing / 62 skipped** (measured 2026-06-09; | |
| canonical count + why skips vary by env: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)). | |
| ## What's gapped (honest, NOT closed) | |
| 1. **Docker / TorchForge substrate E2E** is **hardware-blocked** — the test exists and skips | |
| cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end. | |
| 2. **The full 8B LMA channel ladder (A2–A4) is not yet runnable.** Only **A1 (GRPO-only)** | |
| has a real Modal runner. **A2 (SDPO) / A3 (replay-DPO) / A4 (combined)** are scaffold + | |
| plan-builder only — running them on a real 8B checkpoint additionally needs a real | |
| error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that | |
| don't exist yet. The real 8B run is *additionally* user-budget-gated. | |
| 3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the | |
| GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design. | |
| See [`BACKLOG.md`](../BACKLOG.md) for the live gap list, the **Foot-guns worth knowing | |
| on day one** section just below for the day-one gotchas (branch sync, `strip_thinking`, | |
| k1/k3, `compose_loss`-is-harness), and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) | |
| for install/runtime failure modes. | |
| ## Foot-guns worth knowing on day one | |
| - **Branch sync (resolved 2026-06-09).** `main` is canonical and kept in sync with `master`, | |
| so a fresh Hub clone of `main` installs the complete tree. If you ever `ImportError` on | |
| `make_dr_grpo_config`, your clone is stale (`git fetch && git checkout main`). Historically | |
| `main` lagged `master`; that's fixed as long as both stay synced. | |
| - **`strip_thinking` × SDPO.** On real agent traces, SDPO requires `strip_thinking=False`: | |
| ~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks. | |
| - **KL estimator delta.** TRL uses the **k3** estimator; Composer's report describes **k1**. | |
| This is a documented, intentional delta — the framework does not silently claim k1 parity. | |
| - **`compose_loss` is the verification harness, not production.** Its channel-1 is an LM-CE | |
| stub, not real GRPO. Production training is `ComposerReplicationTrainer`. | |
| ## Where to go next | |
| | You want to… | Read | | |
| |---|---| | |
| | Pitch / status / roadmap | [`README.md`](../README.md) | | |
| | Run it end-to-end | [`docs/USER_GUIDE.md`](USER_GUIDE.md) | | |
| | Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | [`docs/INTEGRATION_RECIPES.md`](INTEGRATION_RECIPES.md) | | |
| | Exact kwargs / signatures | [`docs/API_REFERENCE.md`](API_REFERENCE.md) | | |
| | Why each design decision | [`docs/adrs/README.md`](adrs/README.md) | | |
| | How Cursor's recipe maps to our components | [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) | | |
| | Honest gaps / open work | [`BACKLOG.md`](../BACKLOG.md), [`docs/VISION_VALIDATION.md`](VISION_VALIDATION.md) | | |
| | Fix a broken install / run | [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) | | |