Multi-input affine experiment (control arm — 1 slot, no affine)

Continued AV-SFT warm-started from syvb/nanonla-qwen3-8b-L24-av (Qwen3-8B, injection layer 24, d_model 4096). Both arms trained with the standalone (miles-free) trainer at 3000 steps, eff. batch 32, lr 2e-5 cosine→2e-6, warmup 50, injection_scale = sqrt(d_model), single seed.

Experiment arm: 16 repeated injection markers, each with its own learned affine A_i·v_norm + b_i over the normalized injected activation (identity-init, fp32, +269M params). Affines and backbone trained.
Control arm: 1 marker, no affine (status quo). Same warm-start, data, and budget.

Result — held-out NLL (lower is better)

Evaluated on 832 document-disjoint held-out rows (val rows whose doc_id does not appear in the training split), gold activation injected, paired (same rows, both arms), token-NLL on the response.

arm	val NLL/token	perplexity
16-slot affine	1.4141	4.113
1-slot control	1.4305	4.181
Δ (control − experiment)	+0.0165	—

Paired row-level bootstrap (10k resamples): Δ = +0.0165 nats/token, 95% CI [+0.0148, +0.0181] (excludes 0).

Honest interpretation (read this)

This is a small, consistent, system-level improvement — not a clean demonstration that "the affine improves verbalization." Specifically:

Effect size is small: +0.0165 nats ≈ 1.1% relative (ppl 4.18 → 4.11).
Repetition is confounded with the affine. The experiment changes two things at once vs. control: 16× marker repetition and the learned affine. This run cannot separate them. Inspection of the trained affine shows each slot moved only ~4–6% (Frobenius) from identity — so much of the behavior is still "inject ~v_norm into 16 slots" (pure repetition), with the affine a small perturbation on top. A "16 markers, no affine" arm is required to isolate the affine's contribution.
The slots did learn distinct transforms: across-slot divergence ‖W_i − W̄‖ is nearly as large as each slot's deviation from identity, i.e. the 16 affines are mostly unique, not a shared shift — partial support for the "independent views" idea, but not proof it's what helps.
Single training seed. The CI reflects eval-row noise only, not training variability; a re-trained pair could shift. A +1% gap is within plausible seed-to-seed range for 2.1-epoch continued SFT on an 8B model.
NLL is a proxy. This measures teacher-forced response NLL, not the NLA paper's verbalization/reconstruction-quality metrics (FVE, downstream evals).

For the 16-slot model the per-slot affines are in nla_affine.safetensors (weight [16,4096,4096], bias [16,4096]); apply at injection time — see launch/eval_av_val_loss.py --multi-input-slots 16 --affine-path ....

wandb: https://wandb.ai/octahedral-systems/nla-multi-affine-experiment