Aether Mind v6.1 — long-context after the NaN fix

V6.1 is the third public Aether release and the first that trains on a meaningfully long context window. It supersedes aether-mind-v6.0 which was published with a forced ctx=64 workaround because of a forward-pass numerical instability in the NSA compressed branch (v6/attention.rs::compressed_branch).

That instability is now diagnosed + fixed. Compressed-branch attention's causal mask was producing all--inf rows for query positions before the first 64-token block completed, driving softmax to 0/0 = NaN. The fix tracks per-row validity, unmasks a single block on otherwise-fully-masked rows to keep softmax finite, and multiplies the branch output by a row-validity mask so those rows contribute zero attention (their proper behaviour). Source + verification log in docs/ops/v6-training-nan-bug.md; the fix landed in commit 7f9189f8.

V6.1 was trained at 4× the v6.0 context (256 vs 64 tokens) on the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti, in the same wall-clock envelope (~44 min vs v6.0's 50 min — slightly faster because no Qwen teacher forward).

What you're getting

Field	Value
Base model	`Qwen/Qwen2.5-0.5B-Instruct` (initialised from, then CE-trained)
Architecture	V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64
Trainable params	~558 M (all weights, no LoRA)
Training mode	Pure cross-entropy (no distillation in this release — see notes below)
Training context	256 tokens (4× the v6.0 release)
Precision	BF16 weights, F32 KL/CE math internally for numerical stability
NSA config	compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4
Vocab	151,936 (Qwen2.5 tokenizer, untouched)
Max position	32,768 (RoPE theta = 1e6)
Checkpoint published	step 30,000 (full Phase-1 run)
File	`model.safetensors` (1.32 GB, BF16)
License	Apache-2.0 (matches base)

Training run

Metric	Value	Δ vs v6.0
Steps	30,000	=
Wall-clock	44.4 min	−10 %
Tokens scored	1,676,479	+0.3 % (4× context lets more rows fit)
Throughput	629.9 tokens/sec	+12 %
Mean CE loss	10.18 nats/token	better (v6.0 was 10.35 mean CE under the KL blend)
Mean Sephirot aux	0.149	=
Max tokens processed	167	(v6.0 truncated to 64)
NaN events	0	(v6.0 also 0 thanks to the ctx=64 workaround)

Loss trajectory

step      1  loss=15.75  avg=15.75   (random init)
step    100  loss=15.94  avg=16.32   warm-up
step   1000  loss=11.63  avg=13.20   ← CE/lm-head learning the vocab
step   5000  loss=10.00  avg=11.01
step  10000  loss= 9.13  avg=10.07   ← representational floor (much lower than v6.0's 7.68 at this step — but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step  15000  loss=11.13  avg= 9.87
step  20000  loss=10.25  avg=10.02
step  25000  loss= 9.75  avg=10.15
step  29999  loss= 9.81  avg=10.18

The interesting fact: at step 122 (the row where v6.0 first NaN'd — tokens=167), v6.1 reads a real loss in the 9-16 range and continues training. This release is the empirical proof that the compressed-branch fix is the right one.

Architecture (unchanged from v6.0)

V6 is not a vanilla Qwen2.5 fine-tune. The attention layer implements a 14-head split designed for on-chain cognitive routing:

10 Sephirot heads — one per cognitive domain (Keter → Malkuth). Each head's attention pattern is what the on-chain pallet_qbc_aether_anchor records as the per-cycle attestation root.
2 generalist heads — un-gated, full-context attention. Used for the "global workspace" path in aether-mind.
2 sink heads — anchor-token attention (first 4 tokens) for stable long-context performance.

The NSA compressed branch (the one that NaN'd) now correctly handles the early-query case via row-validity masking.

How to use

Native runtime (recommended) — Rust `aether-mind`

Set AETHER_V6_CHECKPOINT to the local path of model.safetensors, restart qbc-aether-mind.service. The Rust binary loads via candle.

Python

from safetensors.torch import load_file
weights = load_file("model.safetensors")  # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))

There is no upstream 🤗 transformers loader for the V6 14-head split + Sephirot routing. Production use goes through the Rust binary in qubitcoin-aether.

Evaluation

Not yet run. lm-evaluation-harness vs MMLU / ARC / HellaSwag / TruthfulQA is the next session's work. We will back-fill the numbers + comparison vs v5.2-lora + v6.0 here when they land.

Notes vs v6.0

No KL distillation in this release. The full distillation path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at the new ctx=256 because the F32-stable KL log-softmax of the 151K-vocab tensor allocates ~600 MB of intermediates per step that don't free fast enough. Memory optimisation (in-place softmax, KL chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over the 4× longer context — a different bet that prioritises context reach over teacher matching.
All 30K steps used the new attention path. The NaN-safe compressed branch runs by default; no env var or config to enable it.
Same architecture, weights file format, tokenizer, and config shape as v6.0. The Rust binary loads v6.0 and v6.1 from the same loader.

Open items for v6.2

Restore KL+CE distillation at ctx ≥ 256 by chunking the 151K-vocab log-softmax (compute per-512-token vocab-chunk so peak memory stays bounded).
Long-context curriculum (16K → 64K → 128K → 1M) per the V6 master spec, now that the forward-pass NaN is gone.
lm-evaluation-harness pass for honest numbers.
HumanEval / coding evals if we add a coding-domain corpus chunk.

License + citation

Apache-2.0 (matches the base model license).

@misc{aether_mind_v61_2026,
  title  = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
  author = {{BlockArtica} and {QuantumAI-Blockchain}},
  year   = {2026},
  url    = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}

Model tree for QuantumAI-Blockchain/aether-mind-v6.1

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(772)

this model

QuantumAI-Blockchain
/

aether-mind-v6.1

Aether Mind v6.1 — long-context after the NaN fix

What you're getting

Training run

Loss trajectory

Architecture (unchanged from v6.0)

How to use

Native runtime (recommended) — Rust `aether-mind`

Python

Evaluation

Notes vs v6.0

Open items for v6.2

License + citation

Links

Framework versions

Model tree for QuantumAI-Blockchain/aether-mind-v6.1

Aether Mind v6.1 — long-context after the NaN fix

What you're getting

Training run

Loss trajectory

Architecture (unchanged from v6.0)

How to use

Native runtime (recommended) — Rust aether-mind

Python

Evaluation

Notes vs v6.0

Open items for v6.2

License + citation

Links

Framework versions

Model tree for QuantumAI-Blockchain/aether-mind-v6.1

Native runtime (recommended) — Rust `aether-mind`