HAGI Ablation - Model C - Clifford, Bolted-On

TL;DR - Clifford grade decomposition WITHOUT recurrence. One arm of a four-model controlled ablation testing whether Clifford grade decomposition improves reasoning-per-parameter in a small language model. Headline result: it does not - on held-out validation the plain recurrent baseline (B) beats full GDR (D) on every seed. This card documents this arm and links the rest of the family so you can traverse the whole experiment.

One of four same-budget models in the HAGI Grade-Decomposed Recurrence (GDR) ablation. All four share data, tokenizer, schedule, and token budget and differ in only two flags - use_loop and use_gdr - so the comparison isolates the mechanism with no confounds.

Research artifact, not a product. ~114M parameters, trained on 500M tokens (well under Chinchilla-optimal for this size). It writes fluent, on-topic English but its facts are unreliable. Its purpose is a controlled scientific comparison.

This model

Role Clifford grade decomposition WITHOUT recurrence
Recurrence (use_loop) none
Clifford GDR (use_gdr) grades (scalar/vector/bivector/trivector + geometric product)
Parameters 114.6M
Checkpoint step-00007630.pt (model + optimizer state + config dict)
Tokens seen 500M (sequence length 1024)
Eval loss / perplexity 3.4771 / 32.37 (shared 819,200-token set, seed 42)

What is HAGI / GDR?

HAGI tests one hypothesis: does decomposing a recurrent transformer's hidden state into Clifford-algebra grades - scalar / vector / bivector / trivector, each with its own update rate, plus a geometric-product cross-grade interaction - improve reasoning-per-parameter in small models?

A standard recurrent-depth transformer iterates over a flat hidden vector and gains plateau after a few iterations because every dimension converges at the same rate. GDR splits the state into grades that evolve at different rates, so each reasoning iteration has distinct dynamics.

The ablation (where this model fits)

Model Recurrence Clifford GDR Params
A - - 113.3M
B loop x3 - 113.3M
C (this model) - grades 114.6M
D loop x3 grades 114.6M

Decisive comparison: B vs D - identical parameters and compute pattern; the only difference is grade decomposition. Secondary: C vs D (integrated vs bolted-on Clifford). A is the floor. Read with the gates in docs/ABLATION.md.

Architecture (shared by all four)

Component Value
Hidden size 768
Layers 12 (4 perception / 4 reasoning / 4 expression)
Attention Grouped-Query Attention (12 query heads, 4 KV heads)
MLP SwiGLU (768 -> 2048 -> 768)
Positional RoPE (theta=10000)
Norm RMSNorm (pre-norm)
Embeddings weight-tied input/output
Vocabulary 49,152 (SmolLM2 tokenizer)
Sequence length 1024
Precision bf16

Models C and D add the GDR grade-update MLPs + geometric product (+~1.3M params -> 114.6M). Models B and D loop the reasoning core 3x per forward.

Training (shared)

Data FineWeb-Edu sample-10BT
Tokens 500M (~7,629 steps)
Optimizer AdamW (lr 3e-4, wd 0.1, cosine decay, 400-step warmup)
Effective batch 65,536 tokens/step (batch 16 x grad-accum 4 x seq 1024)
Hardware Google Colab A100-40GB (bf16 + FlashAttention + torch.compile)

Load and run (free, CPU)

git clone -b experimental https://github.com/ShmidtS/HAGI.git && cd HAGI
pip install -r requirements.txt
python scripts/generate.py --hf-repo NAME0x0/hagi-ablation-c \
    --prompt "The sun is a star that" --device cpu

Inference fits in <1GB - no GPU needed. Or load the checkpoint directly:

import torch, torch.nn.functional as F
from huggingface_hub import hf_hub_download
from prototype.training.loop import load_checkpoint
from prototype.data.tokenizer import load_tokenizer

ckpt = hf_hub_download("NAME0x0/hagi-ablation-c", "step-00007630.pt")
model, step = load_checkpoint(ckpt, device="cpu"); model.eval()
tok = load_tokenizer("HuggingFaceTB/SmolLM2-135M")

x = torch.tensor([tok.encode("The sun is a star that")])
for _ in range(60):
    with torch.no_grad():
        logits = model(x)[0, -1]
    x = torch.cat([x, torch.multinomial(F.softmax(logits / 0.8, -1), 1).view(1, 1)], dim=1)
print(tok.decode(x[0].tolist()))

Model family (click to traverse)

All models below share data, tokenizer, schedule, and token budget. Stage 0 is the separate pretraining-baseline track; A/B/C/D are this controlled ablation.

Model Role
Stage 0 pretraining baseline (separate track, not part of this ablation)
A dense baseline - the experimental control / floor
B recurrence alone (looped reasoning core, flat hidden state)
C - you are here Clifford grade decomposition WITHOUT recurrence
D loop + grade decomposition (the full mechanism under test)

Results

All four scored on the same 819,200-token fixed batch set (identical batches; lower loss is better). One run per model, training seed 42.

Model Params Loss Perplexity
A 113.3M 3.4880 32.72
B 113.3M 3.4852 32.63
C (this) 114.6M 3.4771 32.37
D 114.6M 3.4702 32.14
  • B vs D: loss D-B = -0.0150 (negative = grade decomposition helps).
  • C vs D: loss D-C = -0.0069 (negative = integrated GDR beats bolted-on Clifford).
  • Ordering A < B < C < D matches the hypothesis: the grade decomposition carries the signal; recurrence helps mainly in its presence (loop-alone B barely moves A).

These are train-set numbers (one run per model). They did NOT hold up: on held-out validation across 5 seeds (note below), the ranking reverses - B beats D every time (mean D-B = +0.0175). The small train-set D-advantage does not generalize, consistent with mild overfitting by D's extra parameters. Conclusion: grade decomposition does not improve held-out loss at this scale. A negative result - see docs/ABLATION.md for the gates.

This model's sample (prompt "The sun is a star that", temperature 0.8, seed 42):

The sun is a star that is the main star in the picture. It is an extremely hot star that can be seen through the sun. This is also the brightest star in the picture. The sun is another star that can

Seed stability: across 5 seeds on held-out validation (shard_00006.bin), D beat B in 0/5 runs (mean loss D-B = +0.0175).

Geometry diagnostic (follow-up)

To locate why full GDR (D) lost to recurrence-only (B), a third arm - D_nogeo, GDR with the geometric-product cross-grade term switched off (same 114.6M parameters as D) - was trained head-to-head with B and D on the same held-out shard (Kaggle T4, fp16, 1 seed: [1]).

Comparison Mean held-out loss delta (lower = better)
D - B (full GDR vs recurrence-only) +0.0210
D_nogeo - B (grades, no geometric product) +0.0176

Removing the geometric product does not recover the gap - the grade machinery itself hurts at this scale, independent of the geometric product. GDR-as-built is falsified here; the path forward is a paper-faithful rebuild or a pivot to capability.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train NAME0x0/hagi-ablation-c