Multi-input affine experiment (control arm β€” 1 slot, no affine)

Continued AV-SFT warm-started from syvb/nanonla-qwen3-8b-L24-av (Qwen3-8B, injection layer 24, d_model 4096). Both arms trained with the standalone (miles-free) trainer at 3000 steps, eff. batch 32, lr 2e-5 cosine→2e-6, warmup 50, injection_scale = sqrt(d_model), single seed.

  • Experiment arm: 16 repeated injection markers, each with its own learned affine A_iΒ·v_norm + b_i over the normalized injected activation (identity-init, fp32, +269M params). Affines and backbone trained.
  • Control arm: 1 marker, no affine (status quo). Same warm-start, data, and budget.

Result β€” held-out NLL (lower is better)

Evaluated on 832 document-disjoint held-out rows (val rows whose doc_id does not appear in the training split), gold activation injected, paired (same rows, both arms), token-NLL on the response.

arm val NLL/token perplexity
16-slot affine 1.4141 4.113
1-slot control 1.4305 4.181
Ξ” (control βˆ’ experiment) +0.0165 β€”

Paired row-level bootstrap (10k resamples): Ξ” = +0.0165 nats/token, 95% CI [+0.0148, +0.0181] (excludes 0).

Honest interpretation (read this)

This is a small, consistent, system-level improvement β€” not a clean demonstration that "the affine improves verbalization." Specifically:

  • Effect size is small: +0.0165 nats β‰ˆ 1.1% relative (ppl 4.18 β†’ 4.11).
  • Repetition is confounded with the affine. The experiment changes two things at once vs. control: 16Γ— marker repetition and the learned affine. This run cannot separate them. Inspection of the trained affine shows each slot moved only ~4–6% (Frobenius) from identity β€” so much of the behavior is still "inject ~v_norm into 16 slots" (pure repetition), with the affine a small perturbation on top. A "16 markers, no affine" arm is required to isolate the affine's contribution.
  • The slots did learn distinct transforms: across-slot divergence β€–W_i βˆ’ WΜ„β€– is nearly as large as each slot's deviation from identity, i.e. the 16 affines are mostly unique, not a shared shift β€” partial support for the "independent views" idea, but not proof it's what helps.
  • Single training seed. The CI reflects eval-row noise only, not training variability; a re-trained pair could shift. A +1% gap is within plausible seed-to-seed range for 2.1-epoch continued SFT on an 8B model.
  • NLL is a proxy. This measures teacher-forced response NLL, not the NLA paper's verbalization/reconstruction-quality metrics (FVE, downstream evals).

For the 16-slot model the per-slot affines are in nla_affine.safetensors (weight [16,4096,4096], bias [16,4096]); apply at injection time β€” see launch/eval_av_val_loss.py --multi-input-slots 16 --affine-path ....

wandb: https://wandb.ai/octahedral-systems/nla-multi-affine-experiment

Downloads last month
34
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for syvb/nanonla-qwen3-8b-L24-av-ctrl

Finetuned
(9)
this model

Collection including syvb/nanonla-qwen3-8b-L24-av-ctrl