Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Composer Replication Framework — User Guide | |
| A zero-to-training walkthrough for the open replication of Cursor Composer 2.5. | |
| Pace: an ML engineer who knows GRPO/DPO at a textbook level but has never | |
| opened this repo. Every step references real code, and every kwarg name | |
| listed below has been imported and verified against | |
| `composer_replication/` source. | |
| --- | |
| ## 1. What is this framework? | |
| A pure-PyTorch replication of the **3-channel composer loss** that powers | |
| agentic-coding model training. One model, one optimizer, three additive | |
| loss terms — composed every step: | |
| ``` | |
| ┌────────────────────────────────────────────┐ | |
| │ compose_loss(model, batch) │ | |
| └────────────────────────────────────────────┘ | |
| │ | |
| ┌─────────────────────────────────┼─────────────────────────────────┐ | |
| ▼ ▼ ▼ | |
| ┌───────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ | |
| │ Channel 1 (RL) │ │ Channel 2 (SDPO) │ │ Channel 3 (replay) │ | |
| │ GRPO │ │ hint-distillation │ │ multi-teacher DPO │ | |
| │ → lm_ce stub in │ │ generalized JSD │ │ on (chosen, │ | |
| │ verification │ │ student vs teacher │ │ rejected) pairs │ | |
| │ harness │ │ (hint-conditioned) │ │ from N teachers │ | |
| └─────────┬─────────┘ └──────────┬───────────┘ └──────────┬───────────┘ | |
| │ weight = 1 (always on) │ alpha_sdpo beta_replay │ | |
| └────────────────┬──────────────┴─────────────────┬──────────────────┘ | |
| ▼ ▼ | |
| total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo | |
| (channel auto-disables if its weight=0 OR its inputs are missing) | |
| ``` | |
| Two API surfaces, on purpose: | |
| - **Verification harness** — `compose_loss(model, batch, ...)` is a free | |
| function (channel 1 = LM cross-entropy, the GRPO limit under deterministic | |
| rewards). Use it for CPU smokes, unit tests, and gradient-flow debugging. | |
| - **Production trainer** — `ComposerReplicationTrainer` is a `trl.GRPOTrainer` | |
| subclass that overrides `_compute_loss` with the same 3 channels on top of | |
| TRL's real reward + advantage machinery. | |
| The verification harness is what you'll use for sections 2–6; the production | |
| trainer (and its alternates VeRL/PRIME-RL/Monarch) is section 8. | |
| Source of truth: `composer_replication/loss.py` for `compose_loss`, | |
| `composer_replication/trainer/composer_trainer.py` for the trainer subclass. | |
| --- | |
| ## 2. Install — which extras to pick | |
| Always start with the core install: | |
| ```bash | |
| git clone https://huggingface.co/Codeseys/composer-replication-framework | |
| cd composer-replication-framework | |
| pip install -e . | |
| ``` | |
| That gets you `torch>=2.0` + `transformers>=4.46` and is enough for the | |
| verification harness on CPU (sections 3, 5, 6). | |
| The seven optional extras are declared in `pyproject.toml` `[project.optional-dependencies]`: | |
| ``` | |
| Do you need … | |
| │ | |
| ┌──────────────────────────┼──────────────────────────┐ | |
| ▼ ▼ ▼ | |
| real teacher calls DiLoCo on production | |
| over OpenRouter? >1 GPU? GRPO training? | |
| │ │ │ | |
| │ yes │ yes │ yes | |
| ▼ ▼ ▼ | |
| pip install -e ".[replay]" pip install -e ".[diloco]" pip install -e ".[train]" | |
| (httpx) (torchft-nightly) (trl, peft, accelerate, datasets) | |
| │ │ │ | |
| │ + want CPU-side │ + scaling beyond a │ + want PRIME-RL | |
| │ DPO normalization? │ single host? │ (Recipe C)? | |
| ▼ ▼ ▼ | |
| pip install -e \".[replaysim]\" pip install -e \".[serverless]\" pip install -e \".[prime-rl]\" | |
| (data-juicer; depends (fsspec, huggingface_hub) (prime-rl>=0.5) | |
| on [replay]) | |
| │ + Monarch actor mesh? | |
| ▼ | |
| pip install -e \".[monarch]\" | |
| (monarch>=0.4.1) | |
| ``` | |
| Quick decision table: | |
| | Goal | Install | | |
| |-------------------------------------------------------|------------------------------------------| | |
| | CPU smoke / verification (sections 3, 5, 6) | `pip install -e .` | | |
| | Section 4 (replaysim DJNormalizer) | `pip install -e ".[replaysim]"` | | |
| | Section 7 dev loop (LocalProcessExecutor + file://) | `pip install -e ".[serverless]"` | | |
| | Real DiLoCo outer-loop | `pip install -e ".[diloco,serverless]"` | | |
| | Section 8 Recipe A (TRL GRPO) | `pip install -e ".[train]"` | | |
| | Section 8 Recipe C (PRIME-RL) | `pip install -e ".[prime-rl]"` | | |
| | Section 8 Recipe C+D (PRIME-RL + Monarch) | `pip install -e ".[prime-rl,monarch]"` | | |
| | Everything for development | `pip install -e ".[dev]"` | | |
| --- | |
| ## 3. Quickstart: `examples/qwen_05b_quickstart` end-to-end on CPU | |
| The fastest way to convince yourself the framework works on a real HF model. | |
| ~3–5 min wall-clock on CPU, ~1 GB disk for Qwen2.5-0.5B weights. | |
| ```bash | |
| pip install -e . | |
| python examples/qwen_05b_quickstart/run.py | |
| ``` | |
| What the script does (read the source at | |
| `examples/qwen_05b_quickstart/run.py`): | |
| 1. Pin RNG (`random.seed(42)`, `torch.manual_seed(42)`) so the per-step | |
| numbers below are reproducible. | |
| 2. Load `Qwen/Qwen2.5-0.5B-Instruct` on CPU in fp32, set `model.train()`. | |
| 3. `batch = build_batch(tokenizer, device="cpu")` — a real chat-template-formatted | |
| batch with all keys the 3-channel composer might consume. | |
| 4. Five backward steps with `compose_loss(model, batch, alpha_sdpo=0.1, | |
| beta_replay=0.05)`; an `AdamW(lr=1e-5)` optimizer; finite-grad check | |
| after each step. | |
| Expected output (transcribed from `examples/qwen_05b_quickstart/run.log`): | |
| ``` | |
| [quickstart] loading Qwen/Qwen2.5-0.5B-Instruct (CPU, fp32) ... | |
| [quickstart] loaded — 0.494B params | |
| [quickstart] building real chat-template batch ... | |
| [quickstart] running 5 backward steps ... | |
| step 0: total=0.7390 lm_ce=0.7358 sdpo=0.0000 dpo=0.0639 finite=True | |
| step 1: total=0.0379 lm_ce=0.0351 sdpo=0.0000 dpo=0.0563 finite=True | |
| step 2: total=0.0122 lm_ce=0.0110 sdpo=0.0000 dpo=0.0240 finite=True | |
| step 3: total=0.0060 lm_ce=0.0055 sdpo=0.0000 dpo=0.0098 finite=True | |
| step 4: total=0.0031 lm_ce=0.0029 sdpo=0.0000 dpo=0.0044 finite=True | |
| ======================================================== | |
| Initial loss: 0.7390 → Final loss: 0.0031 → Reduction: 99.6% | |
| Verdict: PASS | |
| ======================================================== | |
| ``` | |
| How to read this: | |
| - **`total` collapses by ~99%.** The model successfully memorizes the | |
| single batch — exactly what you expect from an SGD pass on a 0.5B model | |
| with one fixed input. This is a wiring check, not a generalization claim. | |
| - **`lm_ce` carries almost all the magnitude.** Channel 1 (the GRPO stub) | |
| is doing the work — the response tokens are short and have low entropy | |
| under the trained model. | |
| - **`sdpo=0.0000` on every step.** Channel 2 has auto-disabled because the | |
| default `build_batch` does not include `ctx_teacher_input_ids`. Compare | |
| the conditional in `compose_loss`: | |
| ```python | |
| if (alpha_sdpo > 0.0 | |
| and "ctx_teacher_input_ids" in inputs | |
| and inputs["ctx_teacher_input_ids"].numel() > 0): | |
| ``` | |
| — channel auto-off if either the weight or the inputs are missing. | |
| - **`dpo > 0` and trending down.** The batch *does* include | |
| `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, | |
| `dpo_chosen_ref_logprobs` (and the rejected counterparts), so channel 3 | |
| is live. | |
| - **`finite=True`** — every step's `p.grad` was finite for every parameter. | |
| This is the wiring contract; if it ever flips to `False` the smoke fails. | |
| If you see `Verdict: PASS`, the framework is correctly installed and | |
| gradients flow through all live channels. You are ready for section 4. | |
| --- | |
| ## 4. Adding the trace-replay channel | |
| The quickstart batch *had* DPO inputs, but they were synthetic — the | |
| `build_batch` helper bakes them in. To get **real** DPO pairs from | |
| multi-teacher disagreement, use the replaysim package. | |
| ### 4a. Spin up `replay_trace` | |
| ```python | |
| import asyncio | |
| from composer_replication import ( | |
| DEFAULT_TEACHERS, replay_trace, extract_dpo_pairs, | |
| ) | |
| # Trace must be a list[TraceState]; see composer_replication/teacher_replay.py | |
| # for the exact TypedDict shape. Each state holds a chat-messages prefix + | |
| # the student's actual action at that step. | |
| states = [...] # your frozen agentic trace; see spike 001 for a 50-step example | |
| teacher_actions = asyncio.run( | |
| replay_trace( | |
| states=states, | |
| teachers=DEFAULT_TEACHERS, # claude-opus-4.7 + gpt-5 + deepseek-v4-pro | |
| max_total_usd=10.0, # hard ceiling (spike 001 measured $0.98/trace mean) | |
| ) | |
| ) | |
| ``` | |
| The 3 teachers are queried in parallel via OpenRouter | |
| (`OPENROUTER_API_KEY` in env or `~/.hermes/.env`), latencies recorded, | |
| costs tracked. | |
| ### 4b. Get `DPOPair`s from disagreement | |
| ```python | |
| pairs = extract_dpo_pairs( | |
| states=states, | |
| teacher_actions=teacher_actions, | |
| agreement_threshold=2, # at least 2/3 teachers must agree on the chosen action | |
| ) | |
| ``` | |
| Each pair is a `DPOPair` TypedDict with the exact shape the | |
| `DJNormalizer` and downstream training expects: | |
| ```python | |
| class DPOPair(TypedDict): | |
| state_id: str | |
| state_messages: list[dict] # conversation context | |
| chosen: str # teacher-consensus action | |
| rejected: str # student action | |
| n_teachers_agreeing: int | |
| ``` | |
| (verified in `composer_replication/teacher_replay.py:99–105`). | |
| ### 4c. Run `DJNormalizer` with `default.yaml` | |
| ```python | |
| from composer_replication.replaysim import DJNormalizer | |
| normalizer = DJNormalizer() # uses recipes/replaysim/default.yaml | |
| normalized = normalizer.normalize(pairs) | |
| # → list[NormalizedDPOPair] | |
| ``` | |
| `DJNormalizer` shells out to data-juicer's `DefaultExecutor` under the hood | |
| (file-in / file-out contract). The default recipe at | |
| `composer_replication/recipes/replaysim/default.yaml` runs four CPU-only ops | |
| in order: | |
| 1. `text_length_filter` (8 ≤ chars ≤ 32000) on `chosen` and `rejected` | |
| 2. `words_num_filter` (2 ≤ words ≤ 4096) on both | |
| 3. `special_characters_filter` (≤50% non-alpha) on both | |
| 4. `document_deduplicator` (per-batch hashing, lowercase, ignore non-character) on `chosen` | |
| Records carry **two parallel shapes** for `chosen`/`rejected`: | |
| - flat strings (`chosen`, `rejected`) → consumed by data-juicer's text_key-based filters | |
| - chat-messages lists (`chosen_messages`, `rejected_messages`) → preserved for chat-aware ops + round-trip | |
| This dual-shape design (verified in the test | |
| `test_dpo_pair_to_dj_record_shape`, | |
| `composer_replication/replaysim/tests/test_replaysim.py:44`) is what | |
| unblocked the data-juicer integration in Wave 14. | |
| ### 4d. The 3-record fixture from spike 001 | |
| The fixture lives at | |
| `spikes/001-teacher-replay-cost/states.jsonl` (50 states) and | |
| `spikes/001-teacher-replay-cost/results.jsonl` (the teacher responses, all | |
| priced and timed). The first 3 states are: | |
| ```jsonl | |
| {"id": "state-000", "task": "Fix the failing test in tests/test_auth.py::test_login_with_email", ...} | |
| {"id": "state-001", "task": "Add rate-limiting middleware to the Flask app", ...} | |
| {"id": "state-002", "task": "Refactor the parse_config function — it's 200 lines and has 3 responsibilities", ...} | |
| ``` | |
| For each, all 3 teachers answered (claude-opus-4.7, gpt-5, deepseek-v4-pro); | |
| agreement on the `(c)` choice for state-000 and state-001 (read more | |
| files / check schema first) drives a clean DPO pair where the student's | |
| action becomes the rejected. For state-002, all 3 agreed on `(c)` (write | |
| characterization tests first) → another clean pair. These three records | |
| pass through the `DJNormalizer` default recipe unchanged (length, words, | |
| special-char ratios all in bounds; no duplicates). | |
| The full 50-state trace cost **$0.98 mean** end-to-end across all three | |
| teachers (spike 001 verdict). The framework's cost ceiling | |
| (`max_total_usd`) and VOI gating drop this to ~$0.30/trace projected. | |
| ### 4e. End-to-end one-liner | |
| ```python | |
| from composer_replication.replaysim import replay_and_normalize_trace | |
| teacher_actions, normalized_pairs = await replay_and_normalize_trace( | |
| states=states, | |
| teachers=DEFAULT_TEACHERS, | |
| agreement_threshold=2, | |
| max_total_usd=10.0, | |
| ) | |
| ``` | |
| (`async def`; for sync callers use the sibling `replay_and_normalize_trace_sync` | |
| in `composer_replication.replaysim.normalize`.) | |
| --- | |
| ## 5. Switching DPO → SimPO: one kwarg | |
| ```python | |
| components = compose_loss( | |
| model, batch, | |
| alpha_sdpo=0.1, | |
| beta_replay=0.05, | |
| dpo_variant="simpo", # ← the only line that changes | |
| simpo_beta=2.0, # paper default | |
| simpo_gamma=1.0, # paper default | |
| ) | |
| ``` | |
| The kwarg is verified in `compose_loss`'s signature | |
| (`composer_replication/loss.py:81`): | |
| ```python | |
| dpo_variant: Literal["dpo", "simpo"] = "dpo", | |
| ``` | |
| ### What changes in the loss curve | |
| - **Channel 3 input requirements drop.** `compose_loss` no longer reads | |
| `dpo_chosen_ref_logprobs` / `dpo_rejected_ref_logprobs`. Reference-model | |
| VRAM cost goes to zero. (Source: `composer_replication/loss.py:111–113` | |
| and `composer_replication/distillation/simpo.py:23–27`.) | |
| - **Loss scale shifts.** Standard DPO is | |
| `-logsigmoid(β·[(logπ(c) - logπ_ref(c)) - (logπ(r) - logπ_ref(r))])`. | |
| SimPO is `-logsigmoid(β·[avg_logπ(c) - avg_logπ(r)] - γ)` — average | |
| per-token log-prob (length-normalized) and a constant target margin γ. | |
| - **Loss is ≤ DPO loss when chosen/rejected separation is large.** The | |
| unit test `test_simpo_loss_lower_for_better_separation` | |
| (`composer_replication/distillation/tests/test_distillation_losses.py:35`) | |
| verifies that a wider chosen-vs-rejected gap drives lower SimPO loss — | |
| meaning, in practice, SimPO curves are *steeper* than DPO when the | |
| preference signal is strong, and *flatter* when it's weak. | |
| - **No KL-against-reference regularization.** This is both the upside (no | |
| ref-model serving) and the risk (more tendency to drift). Watch for | |
| reward-hacking-style degeneracies if your preference data has noise. | |
| ### When to use SimPO | |
| - **GPU-poor.** You can't afford to keep a frozen reference policy resident | |
| alongside the trainer. | |
| - **Cold-start preference data.** Length-normalization (avg_logπ vs sum) | |
| helps when chosen/rejected lengths are wildly imbalanced — common in | |
| agentic traces where the student's failed attempt is short and the | |
| teacher's correction is long. | |
| - **You don't have ref logprobs precomputed.** SimPO needs nothing from | |
| the reference policy. | |
| When to **stay on DPO**: when you need the explicit KL anchor against | |
| a known-good reference (e.g., when training over a long horizon and you | |
| want to bound the drift), or when your preference data is very noisy and | |
| the reference acts as a regularizer. | |
| --- | |
| ## 6. Adding TAID / Entropy-Aware OPD wrappers | |
| Channel 2 (SDPO/OPSD) can be replaced by **TAID** (Sakana AI, | |
| arXiv:2501.16937) for capacity-gap distillation, or by | |
| **Entropy-Aware OPD** (ICLR 2026 Spotlight) for per-token forward/reverse-KL | |
| gating. Both are wired through `compose_loss`: | |
| ```python | |
| sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none", | |
| taid_t: float | None = None, # current TAID interpolation coeff | |
| entropy_opd_h_max: float | None = None, | |
| ``` | |
| (verified at `composer_replication/loss.py:82–93`.) | |
| ### TAID (upstream-faithful port) | |
| > **Wave 15 rewrite, breaking change.** The previous in-tree TAID was | |
| > algorithmically different from the paper (it mixed in probability space | |
| > against a frozen step-0 student snapshot and wrapped a symmetric JSD | |
| > criterion). It has been replaced with an upstream-faithful port: | |
| > logit-space mix, current-student-detached anchor, forward-KL criterion. | |
| > Old kwargs `taid_schedule_step`, `taid_total_steps`, `taid_schedule`, | |
| > `taid_alpha_min`, `taid_alpha_max`, plus `inputs["student_init_logits"]` / | |
| > `inputs["student_init_input_ids"]` are **gone**. They have no upstream | |
| > analogue. Use `taid_t` (and optionally `TAIDScheduler`) instead. | |
| The TAID criterion is forward-KL against a logit-space-interpolated target: | |
| ``` | |
| p_t = softmax( (1 - t) · stop_grad(student_logits) + t · teacher_logits ) | |
| L = - mean_token Σ_v p_t(v) · log_softmax(student_logits)(v) | |
| ``` | |
| where `t ∈ [0, 1]` is the interpolation coefficient. At `t=0` the target | |
| is the (detached) student itself — the loss is the entropy of that | |
| distribution and contributes no gradient to the student. At `t=1` it | |
| reduces to standard forward-KL distillation against the teacher. | |
| The schedule that produces `t` is the **trainer's** responsibility. The | |
| package ships an optional `TAIDScheduler` that mirrors the paper's | |
| adaptive momentum scheme: | |
| ```python | |
| from composer_replication.distillation import TAIDScheduler | |
| sched = TAIDScheduler(num_train_steps=10_000) # paper defaults | |
| for step in range(num_train_steps): | |
| components = compose_loss( | |
| model, batch, | |
| sdpo_wrapper="taid", | |
| taid_t=sched.t, | |
| ) | |
| components.total.backward(); optimizer.step() | |
| sched.update_t(components.sdpo_jsd.detach(), global_step=step) | |
| ``` | |
| `TAIDScheduler` defaults match upstream: `t_start=0.4`, `t_end=1.0`, | |
| `alpha=5e-4`, `beta=0.99`. Pass `disable_adaptive=True` to fall back to | |
| the deterministic linear schedule | |
| `t = t_start + progress · (t_end - t_start)`. | |
| If you want a simple fixed schedule (no scheduler), just compute `t` | |
| yourself and pass it in — `compose_loss` validates `taid_t ∈ [0, 1]`. | |
| ### Upstream-parity test | |
| `composer_replication/distillation/tests/test_taid_parity.py` skip-imports | |
| the upstream reference at `/tmp/taid-clone` (clone with | |
| `git clone --depth 1 https://github.com/SakanaAI/TAID /tmp/taid-clone`) | |
| and asserts our `taid_loss(student, teacher, mask, t)` matches upstream | |
| `TAID.compute_loss(...)` within `atol=rtol=1e-5` across `t ∈ {0.0, 0.1, 0.4, | |
| 0.5, 0.9, 1.0}`. This is the load-bearing parity guarantee. | |
| ### Entropy-Aware OPD | |
| Drop-in for channel 2 — gates between forward KL (mode-covering) and | |
| reverse KL (mode-seeking) per token, weighted by the teacher's entropy: | |
| ``` | |
| L = Σ_t w(t) · KL_fwd_t + (1 - w(t)) · KL_rev_t | |
| w(t) = clamp(H_teacher(t) / h_max, 0, 1) | |
| ``` | |
| `entropy_opd_h_max=None` (the default) auto-sets `h_max = log(V)` (the | |
| maximum-entropy bound for a vocab-V softmax). | |
| ### Boundary-condition unit test (proof of correctness) | |
| The test `test_taid_loss_t_zero_target_matches_detached_student` | |
| (`composer_replication/distillation/tests/test_distillation_losses.py`) | |
| pins TAID's `t=0` invariant — the teacher is *completely* hidden from the | |
| gradient because the target collapses to `softmax(student.detach())`: | |
| ```python | |
| def test_taid_loss_t_zero_target_matches_detached_student(): | |
| s1 = torch.randn(1, 2, 4, requires_grad=True) | |
| teacher_a = torch.zeros(1, 2, 4); teacher_a[..., 0] = 10.0 | |
| teacher_b = torch.zeros(1, 2, 4); teacher_b[..., 3] = 10.0 | |
| mask = torch.ones(1, 2) | |
| loss_a = taid_loss(s1, teacher_a, mask, t=0.0) | |
| loss_b = taid_loss(s1, teacher_b, mask, t=0.0) | |
| # Two completely different teachers must give the same loss at t=0. | |
| assert abs(float(loss_a) - float(loss_b)) < 1e-6 | |
| ``` | |
| This is the load-bearing test for TAID: if the `t=0` endpoint ever leaks | |
| teacher signal into the gradient, this test fires and the contract is | |
| broken. The companion test `test_taid_loss_t_one_is_pure_forward_kl` | |
| pins the `t=1` endpoint by hand-computing `-Σ p_teacher · log_q` and | |
| asserting equality. | |
| --- | |
| ## 7. Going multi-replica with serverless DiLoCo | |
| DiLoCo is the outer-loop optimizer that lets you run N replicas in | |
| parallel, sync them every H inner steps, and tolerate slow links — see | |
| `docs/adrs/ADR-005-serverless-diloco.md` for the design. The framework | |
| gives you three increasingly-distant deployments: | |
| ### Step 1 — `LocalProcessExecutor` for development | |
| ```python | |
| from composer_replication.diloco.serverless import ( | |
| LocalProcessExecutor, ObjectStoreAllReduce, | |
| ) | |
| import tempfile | |
| with tempfile.TemporaryDirectory() as td: | |
| rendezvous = ObjectStoreAllReduce(td, rank=0, world_size=4) | |
| executor = LocalProcessExecutor() | |
| handles = executor.launch_replicas( | |
| n_replicas=4, | |
| entrypoint="composer_replication.diloco.serverless.replica_entrypoint", | |
| entrypoint_args={"rendezvous_uri": td, "rank_env": "REPLICA_RANK"}, | |
| ) | |
| results = executor.collect(handles, timeout=600) | |
| ``` | |
| `LocalProcessExecutor` (`composer_replication/diloco/serverless/executor.py:160`) | |
| spawns N child processes via `multiprocessing.get_context("spawn")` and | |
| sets `REPLICA_RANK={0..N-1}` in each child's env. It satisfies the | |
| `ServerlessExecutor` Protocol (line 35) — the same Protocol the cloud | |
| adapters implement. So the dev-loop code is byte-identical to the cloud | |
| deploy: only the executor instance changes. | |
| ### Step 2 — `ObjectStoreAllReduce` as the rendezvous | |
| ```python | |
| # Local file:// for tests | |
| rendezvous = ObjectStoreAllReduce("/tmp/diloco-runs/run42/", rank=0, world_size=4) | |
| # Real S3 (after `pip install -e .[serverless]`) | |
| rendezvous = ObjectStoreAllReduce( | |
| "s3://my-bucket/diloco-runs/run42/", | |
| rank=0, world_size=4, | |
| timeout_s=1800.0, | |
| ) | |
| ``` | |
| The communication pattern is `S3 PutObject + N GetObjects` once per | |
| inner H steps (matches DiLoCo's actual sync cadence, | |
| arXiv:2311.08105 §3.2). For 1B-param bf16, that's ~2 GB / 30 minutes | |
| per replica — well within S3 free-tier. On the inner side the framework | |
| exposes a `MockManager` that drops into the `torchft.Manager` slot, so | |
| you can validate the rendezvous logic before plugging in real torchft | |
| (verified by `test_serverless_diloco_integration.py`). | |
| ### Step 3 — point at `ModalExecutor` / `HFJobsExecutor` | |
| ```python | |
| # Modal (skeleton at composer_replication/diloco/serverless/modal.py) | |
| from composer_replication.diloco.serverless.modal import ModalExecutor | |
| executor = ModalExecutor(image="modal:python3.11", gpu="A100") | |
| # HuggingFace Jobs (skeleton at composer_replication/diloco/serverless/hf_jobs.py) | |
| from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor | |
| executor = HFJobsExecutor(hardware="a10g-large") | |
| # Same Protocol — same launch_replicas / poll / collect calls as Local | |
| handles = executor.launch_replicas(n_replicas=4, ...) | |
| ``` | |
| Both adapters check their cloud SDK at `__init__` time (not at module | |
| import) so they don't break the package if you don't have `modal` or | |
| `huggingface_hub` installed. Production maturity: dev-ready for cloud | |
| trial; per ADR-005, full HA-cluster fan-out lives in v0.2+. | |
| --- | |
| ## 8. Picking an RL backend | |
| Four canonical recipes, each tied to an upstream framework. Source: | |
| `docs/INTEGRATION_ARCHITECTURE.md` Recipes A–D plus | |
| `docs/adrs/ADR-006-rl-frameworks.md`. | |
| ### Recipe A — TRL `GRPOTrainer` subclass | |
| `ComposerReplicationTrainer` is a `trl.GRPOTrainer` subclass that | |
| overrides `_compute_loss(model, inputs)` to compose the same 3 channels | |
| on top of TRL's real reward + advantage machinery. Install: | |
| `pip install -e ".[train]"`. Then: | |
| ```python | |
| from composer_replication import ComposerReplicationTrainer | |
| trainer = ComposerReplicationTrainer(model=..., reward_funcs=[...], ...) | |
| trainer.train() | |
| ``` | |
| **When to use it:** This is the v0.0/v0.1 recommended path. You want | |
| real GRPO with rewards, you have HF model + dataset + (mostly) standard | |
| GRPO infrastructure, and you don't need >100B-param scale. TRL is | |
| mature, the trainer is a small subclass, and the same `compose_loss` | |
| math runs in both the verification harness and in production with no | |
| re-coding. | |
| → See `docs/INTEGRATION_ARCHITECTURE.md` § "Recipe A: TRL `GRPOTrainer` | |
| subclass" (line 205). | |
| ### Recipe B — VeRL custom `adv_estimator` + DataProto extension | |
| VeRL replaces TRL's reward+advantage machinery with a Ray-driven actor | |
| graph that's specifically optimized for distributed RL training of | |
| large LMs. Composition with the framework: extend `DataProto` with the | |
| hint-conditioned columns + DPO pair fields, register a custom | |
| `adv_estimator` that calls the same `compose_loss` body. | |
| **When to use it:** You're past 7B-param, you have multi-host setup | |
| (Ray cluster), and TRL's single-process trainer is the bottleneck. VeRL | |
| is the recommended scale path for v0.2+. Trade-off: the extension surface | |
| is larger and the docs are sparser than TRL's. | |
| → See `docs/INTEGRATION_ARCHITECTURE.md` § "Recipe B: VeRL custom | |
| `adv_estimator`" (line 289). | |
| ### Recipe C — PRIME-RL with DPPO-clip details | |
| `composer_replication/recipes/prime_rl/composer_loss.py` ships a | |
| `loss_fn(inputs, *, alpha_sdpo=0.0, beta_dpo=0.0, dppo_mask_high=0.2, | |
| dppo_mask_low=0.2, adv_tau=1.0, kl_tau=1e-3)` adapter that maps | |
| PRIME-RL's `LossInputs` struct (1-D per-sample tensors: | |
| `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, | |
| `advantages`, `loss_mask`) into our 3-channel composition. | |
| The DPPO+KL bit is what makes PRIME-RL distinctive — and we mirror | |
| PRIME-RL's upstream `default_loss_fn` exactly (verified against | |
| `prime_rl/trainer/rl/loss.py` lines 116-165): | |
| ```python | |
| log_ir = trainer_logprobs - inference_logprobs | |
| ir = exp(log_ir) # importance ratio | |
| probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) | |
| invalid_high = probs_diff > dppo_mask_high # for positive-advantage tokens | |
| invalid_low = probs_diff < -dppo_mask_low # for negative-advantage tokens | |
| invalid = where(advantages > 0, invalid_high, invalid_low) | |
| keep = loss_mask & ~invalid | |
| pg_loss = keep * (adv_tau * advantages) * ir | |
| kl_loss = loss_mask * log_ir**2 | |
| loss = (-pg_loss + kl_tau * kl_loss).sum() | |
| ``` | |
| Three things to remember: (1) the mask gate is on **probability-space** | |
| `exp(trainer_lp) - exp(inference_lp)`, not on the log-ratio; (2) the | |
| policy-gradient term is multiplied by the importance ratio | |
| `exp(trainer_lp - inference_lp)`, not by `trainer_lp` directly (proper | |
| IS-corrected gradient, not REINFORCE); (3) the mask is **conditioned on | |
| the sign of the advantage** — positive-advantage tokens are dropped on | |
| the upper bound, negative-advantage tokens on the lower. Defaults | |
| `dppo_mask_high=dppo_mask_low=0.2` and `adv_tau=1.0, kl_tau=1e-3` | |
| match PRIME-RL's `DefaultLossConfig` (all fields `Field(..., ge=0)`). | |
| SDPO (channel 2) is gated `NotImplementedError` in v0 because PRIME-RL | |
| exposes log-probs, not full vocab logits; channel 3 (trace-replay DPO) | |
| emits a warning if `beta_dpo != 0`. | |
| **When to use it:** You're already in the PRIME-Intellect / decentralized | |
| training universe, you want INTELLECT-style scaling on a long-horizon | |
| training run, and DPPO masking is part of your existing reward+advantage | |
| recipe. Install: `pip install -e ".[prime-rl]"`. | |
| → See `composer_replication/recipes/prime_rl/prime_rl_recipe.md` and | |
| `docs/INTEGRATION_ARCHITECTURE.md` § "Recipe C: TorchForge + Monarch" | |
| (line 356). | |
| ### Recipe C+D — Monarch as actor mesh | |
| Monarch (the actor framework underpinning TorchForge) hosts the | |
| trainer/generator/manager actors in a topology-aware mesh. The framework | |
| ships *skeleton* actor definitions at | |
| `composer_replication/recipes/monarch/actors.py` (TrainerActor, | |
| GeneratorActor) and a layout doc at `monarch_actor_layout.md`. v0 | |
| intentionally *fails fast* if you try to instantiate them | |
| (`raise NotImplementedError("v0 skeleton; deferred to v0.2 per ADR-006")`) | |
| because the upstream Monarch API is still moving. | |
| **When to use it:** Reference-pattern reading only in v0. Decision point | |
| is v0.2 once the upstream actor API stabilizes. Treat the skeleton as | |
| shape-of-the-final-answer documentation, not as a production target. | |
| Install: `pip install -e ".[prime-rl,monarch]"` for the full surface. | |
| → See `composer_replication/recipes/monarch/monarch_actor_layout.md` | |
| and `docs/adrs/ADR-006-rl-frameworks.md`. | |
| --- | |
| ## Common pitfalls + what tests catch them | |
| The framework's 115-test suite (post-Wave-15) is structured so each pitfall has a | |
| specific test-file home. If you hit one of these in production, the | |
| corresponding test is your fastest reproducer. | |
| | Pitfall | Symptom | Test file (catches it) | | |
| |-----------------------------------------------------------------------------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| | |
| | Forgetting `taid_schedule_step` when `sdpo_wrapper="taid"` | `ValueError` at first step | `composer_replication/tests/test_compose_loss_integration.py` (kwarg validation) | | |
| | TAID α=0 endpoint leaks teacher signal | Teacher swap changes the loss when α should be 0 | `test_taid_loss_alpha_zero_ignores_teacher` in `composer_replication/distillation/tests/test_distillation_losses.py:153` | | |
| | TAID α=1 endpoint differs from plain SDPO | Bit-difference vs reference SDPO at the schedule end | `test_taid_blended_logits_endpoints` in `composer_replication/distillation/tests/test_distillation_losses.py:115` | | |
| | SimPO loss not differentiable through the loss-of-sigmoid path | `chosen.grad is None` after backward | `test_simpo_loss_differentiable` in `composer_replication/distillation/tests/test_distillation_losses.py:50` | | |
| | SimPO shape-mismatch slips through silently | Broadcasting bug, NaN downstream | `test_simpo_loss_shape_mismatch_raises` in `composer_replication/distillation/tests/test_distillation_losses.py:61` | | |
| | Entropy-OPD failing to zero out when distributions match | Loss > 0 when student≡teacher | `test_entropy_aware_opd_zero_when_distributions_match` in `composer_replication/distillation/tests/test_distillation_losses.py:217` | | |
| | Entropy of one-hot ≠ 0 / entropy of uniform ≠ log(V) | Wrong gating weights w(t) | `test_teacher_entropy_one_hot_is_zero` and `test_teacher_entropy_uniform_is_log_v` in `composer_replication/distillation/tests/test_distillation_losses.py:175,183` | | |
| | `DJNormalizer` records missing the chat-messages shape | Filters silently no-op or crash | `test_dpo_pair_to_dj_record_shape` in `composer_replication/replaysim/tests/test_replaysim.py:44` | | |
| | `DJNormalizer` round-trip drops `state_messages` / metadata | Lost provenance | `test_dj_record_to_normalized_roundtrip` and `test_dj_record_to_normalized_preserves_state_messages` in `composer_replication/replaysim/tests/test_replaysim.py` | | |
| | `ObjectStoreAllReduce` accepts an out-of-bounds rank | Silent corruption of the all-reduce average | `test_object_store_allreduce_init_validates_rank` in `composer_replication/diloco/serverless/tests/test_serverless_local.py:31` | | |
| | `ObjectStoreAllReduce(world_size=1)` doesn't passthrough cleanly | False all-reduce on single replica | `test_object_store_allreduce_world_size_1_passthrough` in `composer_replication/diloco/serverless/tests/test_serverless_local.py:46` | | |
| | `LocalProcessExecutor` doesn't propagate child failures to `collect()` | Silent test pass when a replica crashed | `test_serverless_diloco_integration.py` in `composer_replication/diloco/serverless/tests/` | | |
| | PRIME-RL adapter accidentally uses `(B, T)` shape instead of per-sample `(seq,)` | Shape mismatch / wrong reduction | `composer_replication/recipes/prime_rl/tests/test_composer_loss.py` (10 tests covering shape and DPPO mask edges) | | |
| | Channel 2/3 fails to auto-disable when its inputs are absent | Crash on missing key, not graceful skip | `composer_replication/tests/test_compose_loss_integration.py` (`(a) defaults reproduce existing compose_loss output bit-exact`) | | |
| Run the full suite with `pytest` from the repo root. | |
| --- | |
| **File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/USER_GUIDE.md` | |