Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

ac4bfb4 4 days ago

4.84 kB

	# Backlog — Composer 2.5 Replication Framework

	Imported from `docs/VISION_VALIDATION.md` § 6 (gaps) + § 9 (gap-closers) at 2026-05-26.

	## Active items (CPU-only, no GPU budget)

	### Spike 006 — Real HF model smoke (Wave 7)

	Closes: V8 ("any HF model") — currently we run only mock 4-layer toy LM through `composer_total_loss`.

	Goal: prove the 3-channel loss (`grpo + α·sdpo_kl + β·trace_replay_dpo`) survives a real `transformers` model + tokenizer with finite gradients and a decreasing loss across N steps.

	Acceptance:
	1. `AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")` loads on CPU.
	2. Real tokenizer `apply_chat_template` produces `input_ids` shape that flows through `composer_total_loss(model, batch)` without mock shapes.
	3. 5 backward steps run on CPU without `nan` / `inf` / shape mismatch.
	4. Loss is monotone non-increasing across 5 steps (trend; allow noise).
	5. New tests added under `spikes/006-real-hf-model-smoke/tests/` pass alongside existing 38.

	Estimate: half a day, CPU only.

	### Spike 007 — Real trace ingestion (Wave 8)

	Closes: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."

	Goal: pick ONE real agent-session log format with stable, public schema, write a `TraceIngester` that converts it to our `TraceExample` dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.

	Acceptance:
	1. ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
	2. `TraceIngester.ingest(path: Path) -> Iterator[TraceExample]` is implemented + has unit tests with a fixture log file.
	3. End-to-end smoke: real trace → ingester → collator → 1-step `composer_total_loss` runs without error.
	4. Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to `spikes/007-*/verdict.md`.

	Estimate: 1 day + ~$2 OpenRouter.

	### Spike 008 — Streaming DiLoCo smoke (Wave 9)

	Closes: V2 (DiLoCo "deferred to v0.2" — drift from original brief).

	Goal: bolt outer-loop pseudo-gradient sync onto the loss composition test using two `nn.Module` replicas on the same node. No real distributed training (CPU multiprocessing or single-process).

	Acceptance:
	1. ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
	2. `outer_optimizer.py` implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.
	3. Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
	4. 38 existing tests still pass (no regression).

	Estimate: 2 days, CPU.

	### Wave 10 — Packaging

	Closes: V4 ("skeleton not framework").

	Goal: turn the assemblage of spike directories into an installable Python package with a clear quickstart.

	Acceptance:
	1. `pyproject.toml` at repo root, package name `composer_replication`.
	2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc.
	3. `examples/qwen3_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve.
	4. README quickstart updated to `pip install -e .` + `python examples/qwen3_05b_quickstart/run.py`.
	5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU.

	Estimate: half a day, CPU.

	## Modal-gated (if budget allows after gap-closers)

	### Spike 002a-mini — Real GPU smoke (Phase 10)

	Closes: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.

	Goal: dispatch a 30-min A10G smoke on Modal that runs Spike 006 unchanged on GPU, verifies bf16 numerics, captures memory + step-time.

	Acceptance:
	1. ADR-001 says Modal is the right choice for this workload + estimate is < $5.
	2. Modal app builds, runs `composer_total_loss` for 50 steps on Qwen2.5-0.5B-Instruct.
	3. Loss curve + memory profile saved to `spikes/002a-mini/` and pulled to local.
	4. No new shape / dtype bug surfaced vs CPU run.

	Estimate: $1–3, 30 min wall-clock.

	## Deferred (post-loop, GPU-gated)

	- Spike 002a/002b — full trace collection on A100 ($30–50)
	- Spike 003 — DPO-pair signal density study
	- Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
	- Publication wave — author identity, thumbnail, X tags, post sequence

	## Process notes

	- Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
	- Each spike has its own `spikes/00N-name/` dir + `verdict.md` recording acceptance + delta from estimate.
	- Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.