Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Backlog — Composer 2.5 Replication Framework
Imported from docs/VISION_VALIDATION.md § 6 (gaps) + § 9 (gap-closers) at 2026-05-26.
Active items (CPU-only, no GPU budget)
Spike 006 — Real HF model smoke (Wave 7)
Closes: V8 ("any HF model") — currently we run only mock 4-layer toy LM through composer_total_loss.
Goal: prove the 3-channel loss (grpo + α·sdpo_kl + β·trace_replay_dpo) survives a real transformers model + tokenizer with finite gradients and a decreasing loss across N steps.
Acceptance:
AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")loads on CPU.- Real tokenizer
apply_chat_templateproducesinput_idsshape that flows throughcomposer_total_loss(model, batch)without mock shapes. - 5 backward steps run on CPU without
nan/inf/ shape mismatch. - Loss is monotone non-increasing across 5 steps (trend; allow noise).
- New tests added under
spikes/006-real-hf-model-smoke/tests/pass alongside existing 38.
Estimate: half a day, CPU only.
Spike 007 — Real trace ingestion (Wave 8)
Closes: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."
Goal: pick ONE real agent-session log format with stable, public schema, write a TraceIngester that converts it to our TraceExample dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.
Acceptance:
- ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
TraceIngester.ingest(path: Path) -> Iterator[TraceExample]is implemented + has unit tests with a fixture log file.- End-to-end smoke: real trace → ingester → collator → 1-step
composer_total_lossruns without error. - Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to
spikes/007-*/verdict.md.
Estimate: 1 day + ~$2 OpenRouter.
Spike 008 — Streaming DiLoCo smoke (Wave 9)
Closes: V2 (DiLoCo "deferred to v0.2" — drift from original brief).
Goal: bolt outer-loop pseudo-gradient sync onto the loss composition test using two nn.Module replicas on the same node. No real distributed training (CPU multiprocessing or single-process).
Acceptance:
- ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
outer_optimizer.pyimplements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.- Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
- 38 existing tests still pass (no regression).
Estimate: 2 days, CPU.
Wave 10 — Packaging
Closes: V4 ("skeleton not framework").
Goal: turn the assemblage of spike directories into an installable Python package with a clear quickstart.
Acceptance:
pyproject.tomlat repo root, package namecomposer_replication.composer_replication/dir with__init__.pyre-exportingcomposer_total_loss,OPSDLoss,TeacherReplayBuffer,compose_loss,TraceIngester, etc.examples/qwen3_05b_quickstart/with end-to-end script that loads model, runs 10 training steps, prints loss curve.- README quickstart updated to
pip install -e .+python examples/qwen3_05b_quickstart/run.py. pip install -e .succeeds and quickstart runs end-to-end on CPU.
Estimate: half a day, CPU.
Modal-gated (if budget allows after gap-closers)
Spike 002a-mini — Real GPU smoke (Phase 10)
Closes: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.
Goal: dispatch a 30-min A10G smoke on Modal that runs Spike 006 unchanged on GPU, verifies bf16 numerics, captures memory + step-time.
Acceptance:
- ADR-001 says Modal is the right choice for this workload + estimate is < $5.
- Modal app builds, runs
composer_total_lossfor 50 steps on Qwen2.5-0.5B-Instruct. - Loss curve + memory profile saved to
spikes/002a-mini/and pulled to local. - No new shape / dtype bug surfaced vs CPU run.
Estimate: $1–3, 30 min wall-clock.
Deferred (post-loop, GPU-gated)
- Spike 002a/002b — full trace collection on A100 ($30–50)
- Spike 003 — DPO-pair signal density study
- Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
- Publication wave — author identity, thumbnail, X tags, post sequence
Process notes
- Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
- Each spike has its own
spikes/00N-name/dir +verdict.mdrecording acceptance + delta from estimate. - Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.