Multi-input affine experiment (control arm β 1 slot, no affine)
Continued AV-SFT warm-started from syvb/nanonla-qwen3-8b-L24-av
(Qwen3-8B, injection layer 24, d_model 4096). Both arms trained with the
standalone (miles-free) trainer at 3000 steps, eff. batch 32, lr 2e-5 cosineβ2e-6,
warmup 50, injection_scale = sqrt(d_model), single seed.
- Experiment arm: 16 repeated injection markers, each with its own learned
affine
A_iΒ·v_norm + b_iover the normalized injected activation (identity-init, fp32, +269M params). Affines and backbone trained. - Control arm: 1 marker, no affine (status quo). Same warm-start, data, and budget.
Result β held-out NLL (lower is better)
Evaluated on 832 document-disjoint held-out rows (val rows whose doc_id
does not appear in the training split), gold activation injected, paired (same
rows, both arms), token-NLL on the response.
| arm | val NLL/token | perplexity |
|---|---|---|
| 16-slot affine | 1.4141 | 4.113 |
| 1-slot control | 1.4305 | 4.181 |
| Ξ (control β experiment) | +0.0165 | β |
Paired row-level bootstrap (10k resamples): Ξ = +0.0165 nats/token, 95% CI [+0.0148, +0.0181] (excludes 0).
Honest interpretation (read this)
This is a small, consistent, system-level improvement β not a clean demonstration that "the affine improves verbalization." Specifically:
- Effect size is small: +0.0165 nats β 1.1% relative (ppl 4.18 β 4.11).
- Repetition is confounded with the affine. The experiment changes two things at once vs. control: 16Γ marker repetition and the learned affine. This run cannot separate them. Inspection of the trained affine shows each slot moved only ~4β6% (Frobenius) from identity β so much of the behavior is still "inject ~v_norm into 16 slots" (pure repetition), with the affine a small perturbation on top. A "16 markers, no affine" arm is required to isolate the affine's contribution.
- The slots did learn distinct transforms: across-slot divergence
βW_i β WΜβis nearly as large as each slot's deviation from identity, i.e. the 16 affines are mostly unique, not a shared shift β partial support for the "independent views" idea, but not proof it's what helps. - Single training seed. The CI reflects eval-row noise only, not training variability; a re-trained pair could shift. A +1% gap is within plausible seed-to-seed range for 2.1-epoch continued SFT on an 8B model.
- NLL is a proxy. This measures teacher-forced response NLL, not the NLA paper's verbalization/reconstruction-quality metrics (FVE, downstream evals).
For the 16-slot model the per-slot affines are in nla_affine.safetensors
(weight [16,4096,4096], bias [16,4096]); apply at injection time β see
launch/eval_av_val_loss.py --multi-input-slots 16 --affine-path ....
wandb: https://wandb.ai/octahedral-systems/nla-multi-affine-experiment
- Downloads last month
- 34
Model tree for syvb/nanonla-qwen3-8b-L24-av-ctrl
Base model
syvb/nanonla-qwen3-8b-L24-av