Aether Mind v6.1 โ long-context after the NaN fix
V6.1 is the third public Aether release and the first that
trains on a meaningfully long context window. It supersedes
aether-mind-v6.0
which was published with a forced ctx=64 workaround because of a
forward-pass numerical instability in the NSA compressed branch
(v6/attention.rs::compressed_branch).
That instability is now diagnosed + fixed. Compressed-branch
attention's causal mask was producing all--inf rows for query
positions before the first 64-token block completed, driving softmax
to 0/0 = NaN. The fix tracks per-row validity, unmasks a single
block on otherwise-fully-masked rows to keep softmax finite, and
multiplies the branch output by a row-validity mask so those rows
contribute zero attention (their proper behaviour). Source +
verification log in
docs/ops/v6-training-nan-bug.md;
the fix landed in commit
7f9189f8.
V6.1 was trained at 4ร the v6.0 context (256 vs 64 tokens) on the same 36,860-row Aether curated corpus, on the same RTX 3080 Ti, in the same wall-clock envelope (~44 min vs v6.0's 50 min โ slightly faster because no Qwen teacher forward).
What you're getting
| Field | Value |
|---|---|
| Base model | Qwen/Qwen2.5-0.5B-Instruct (initialised from, then CE-trained) |
| Architecture | V6 transformer: 24 layers, 896 hidden, 14 attention heads (10 Sephirot + 2 generalist + 2 sink), head_dim=64 |
| Trainable params | ~558 M (all weights, no LoRA) |
| Training mode | Pure cross-entropy (no distillation in this release โ see notes below) |
| Training context | 256 tokens (4ร the v6.0 release) |
| Precision | BF16 weights, F32 KL/CE math internally for numerical stability |
| NSA config | compression_block=64, top_k=2048, sliding_window=512, sink_tokens=4 |
| Vocab | 151,936 (Qwen2.5 tokenizer, untouched) |
| Max position | 32,768 (RoPE theta = 1e6) |
| Checkpoint published | step 30,000 (full Phase-1 run) |
| File | model.safetensors (1.32 GB, BF16) |
| License | Apache-2.0 (matches base) |
Training run
| Metric | Value | ฮ vs v6.0 |
|---|---|---|
| Steps | 30,000 | = |
| Wall-clock | 44.4 min | โ10 % |
| Tokens scored | 1,676,479 | +0.3 % (4ร context lets more rows fit) |
| Throughput | 629.9 tokens/sec | +12 % |
| Mean CE loss | 10.18 nats/token | better (v6.0 was 10.35 mean CE under the KL blend) |
| Mean Sephirot aux | 0.149 | = |
| Max tokens processed | 167 | (v6.0 truncated to 64) |
| NaN events | 0 | (v6.0 also 0 thanks to the ctx=64 workaround) |
Loss trajectory
step 1 loss=15.75 avg=15.75 (random init)
step 100 loss=15.94 avg=16.32 warm-up
step 1000 loss=11.63 avg=13.20 โ CE/lm-head learning the vocab
step 5000 loss=10.00 avg=11.01
step 10000 loss= 9.13 avg=10.07 โ representational floor (much lower than v6.0's 7.68 at this step โ but apples-to-oranges; v6.0 was loss-blended with KL teacher signal)
step 15000 loss=11.13 avg= 9.87
step 20000 loss=10.25 avg=10.02
step 25000 loss= 9.75 avg=10.15
step 29999 loss= 9.81 avg=10.18
The interesting fact: at step 122 (the row where v6.0 first NaN'd โ tokens=167), v6.1 reads a real loss in the 9-16 range and continues training. This release is the empirical proof that the compressed-branch fix is the right one.
Architecture (unchanged from v6.0)
V6 is not a vanilla Qwen2.5 fine-tune. The attention layer implements a 14-head split designed for on-chain cognitive routing:
- 10 Sephirot heads โ one per cognitive domain (Keter โ Malkuth).
Each head's attention pattern is what the on-chain
pallet_qbc_aether_anchorrecords as the per-cycle attestation root. - 2 generalist heads โ un-gated, full-context attention. Used
for the "global workspace" path in
aether-mind. - 2 sink heads โ anchor-token attention (first 4 tokens) for stable long-context performance.
The NSA compressed branch (the one that NaN'd) now correctly handles the early-query case via row-validity masking.
How to use
Native runtime (recommended) โ Rust aether-mind
Set AETHER_V6_CHECKPOINT to the local path of model.safetensors,
restart qbc-aether-mind.service. The Rust binary loads via candle.
Python
from safetensors.torch import load_file
weights = load_file("model.safetensors") # 315 BF16 tensors
print("params:", sum(t.numel() for t in weights.values()))
There is no upstream ๐ค transformers loader for the V6 14-head
split + Sephirot routing. Production use goes through the Rust
binary in
qubitcoin-aether.
Evaluation
Not yet run. lm-evaluation-harness vs MMLU / ARC / HellaSwag / TruthfulQA is the next session's work. We will back-fill the numbers + comparison vs v5.2-lora + v6.0 here when they land.
Notes vs v6.0
- No KL distillation in this release. The full distillation path (KL teacher signal + CE + Sephirot aux) hits a CUDA OOM at the new ctx=256 because the F32-stable KL log-softmax of the 151K-vocab tensor allocates ~600 MB of intermediates per step that don't free fast enough. Memory optimisation (in-place softmax, KL chunking by vocab-tile) is the v6.2 work. v6.1 is CE-only over the 4ร longer context โ a different bet that prioritises context reach over teacher matching.
- All 30K steps used the new attention path. The NaN-safe compressed branch runs by default; no env var or config to enable it.
- Same architecture, weights file format, tokenizer, and config shape as v6.0. The Rust binary loads v6.0 and v6.1 from the same loader.
Open items for v6.2
- Restore KL+CE distillation at ctx โฅ 256 by chunking the 151K-vocab log-softmax (compute per-512-token vocab-chunk so peak memory stays bounded).
- Long-context curriculum (16K โ 64K โ 128K โ 1M) per the V6 master spec, now that the forward-pass NaN is gone.
- lm-evaluation-harness pass for honest numbers.
- HumanEval / coding evals if we add a coding-domain corpus chunk.
License + citation
Apache-2.0 (matches the base model license).
@misc{aether_mind_v61_2026,
title = {Aether Mind v6.1 --- long-context after the compressed-branch NaN fix},
author = {{BlockArtica} and {QuantumAI-Blockchain}},
year = {2026},
url = {https://huggingface.co/QuantumAI-Blockchain/aether-mind-v6.1},
}
Links
- QuantumAI Blockchain: qbc.network
- GitHub org: github.com/QuantumAI-Blockchain
- Aether (Rust): qubitcoin-aether
- Prior releases:
- aether-mind-v6.0 (ctx=64, distilled)
- aether-v5.2-lora (7B LoRA)
- X / Twitter: @qu_bitcoin
- Contact: info@qbc.network
Framework versions
- candle 0.10 + CUDA 12.6
- Rust
aether-v6-trainbinary @ commit7f9189f8 - Qwen2.5 tokenizer (vocab 151,936)
- Downloads last month
- 17