OdinNext-138M-Instruct

A 138.4M-parameter instruction-tuned language model that replaces softmax self-attention with an HGRN2 gated linear recurrence. Fine-tuned from OdinNext-138M-Base, which was pretrained from scratch on 101.6B tokens on two AMD Ryzen AI MAX+ 395 (Strix Halo) mini-PCs — using a TST + DiffusionBlocks + dual-machine DDP stack that trained roughly 10-20x faster than a conventional end-to-end pass on the same hardware.

This is a small model. It follows instructions and writes fluent, assistant-style answers (markdown, step-by-step), but its factual accuracy is limited by scale. Treat it as a lightweight assistant and a research artifact, not a knowledge base.

Uses custom Transformers code. trust_remote_code=True runs Python from this repo — review the files or pin a commit before trusting it.

Results

Zero-shot, on three widely-reported public benchmarks. OdinNext rows were measured with our own harness (scripts/eval_benchmarks.py; HellaSwag = acc_norm, ARC = mean of Easy+Challenge acc, PIQA = acc); the other rows are as reported by Axiomic Labs on the GPT-X2-125M card, so numbers are not perfectly comparable across harnesses.

Company Model HellaSwag ARC (avg) PIQA Training tokens
HuggingFace SmolLM2-135M 43.22% 44.62% 67.52% 2T
Axiomic Labs GPT-X2-125M 40.55% 39.90% 66.97% 75B
HuggingFace SmolLM-135M 42.70% 43.17% 67.19% 600B
Facebook MobileLLM-R1-140M-base 33.91% 37.47% 62.79% 4.2T
Axiomic Labs GPT-X-125M 36.57% 38.84% 65.72% 15B
Facebook MobileLLM-125M 38.90% 35.50% 65.30% 1T
OpenAI GPT-2 (124M) 31.49% 31.40% 63.28% ~10B
EleutherAI Pythia-160M 30.46% 29.95% 57.94% ~225B
Facebook OPT-125M 31.39% 31.53% 62.02% 180B
EleutherAI GPT-Neo-125M 30.55% 31.43% 61.75% 300B
This work OdinNext-138M-Base 33.05% 34.29% 58.81% 101.6B
This work OdinNext-138M-Instruct 32.85% 33.14% 59.25% 101.6B + SFT/SeqKD

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "joelhenwang/OdinNext-138M-Instruct"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype=torch.float16,
).to("cuda").eval()

msgs = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to("cuda")
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.7,
                     top_p=0.9, repetition_penalty=1.3)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

Uses ChatML (<|im_start|>role\n...<|im_end|>). A repetition_penalty around 1.2-1.3 is recommended at this scale.

Architecture

Decoder-only causal LM, 16 pre-norm blocks:

x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU2(ZCRMSNorm(x))
Item Value
Parameters 138.4M (113.3M non-embedding)
Layers / hidden / heads 16 / 768 / 6
Per-head recurrent state 128 x 128
FFN inner 2,048
Vocabulary 32,770 (custom 32K BPE + 2 ChatML tokens)
Max sequence length 2,048
Mixer HGRN2 gated linear recurrence; RoPE (theta=100K) on even layers, position-free on odd
Decoding state fixed-size recurrent state (O(1)/token), not a growing KV cache

The HGRN2 state S_t = diag(exp(g_t)) S_{t-1} + k_t (x) v_t is constant in size w.r.t. context length (~3 MiB fp16 at batch 1) — unlike a Transformer KV cache that grows linearly with tokens.

Training

Data

Pretraining used the Dolmino mix (allenai/dolma3_dolmino_mix-100B-1025), curated by dropping the synthetic and noisy partitions and keeping the natural text + code:

  • Excluded: all synthetic reasoning-trace subsets (Gemini / QwQ / R1 / OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite, verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
  • Kept: natural web text, code (stack-edu, cranecode; FIM markers stripped), math, and reference text — the mix's native proportions minus the exclusions.
  • Tokenizer: a custom 32K BPE. After tokenization this gives 101.6B training tokens.

Post-training data: smol-smoltalk

  • no_robots (SFT), and synthetic ChatML distilled from LFM2.5-1.2B-Instruct (SeqKD teacher).

How we accelerated pretraining (the interesting part)

Pretraining ran on two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5) mini-PCs (128 GB unified LPDDR5X each), linked over Thunderbolt 4, with DDP on the gloo backend. Three techniques compounded:

  1. TST - Token Superposition Training (bag-size 4). Early in training, every position is the average of 4 stochastic sub-word tokenizations of the same text, so the model digests ~4x the tokens per step. The bag size is annealed 4 -> 2 -> 1 over training so the model finishes on ordinary single-token streams.
  2. DiffusionBlocks (B=4). The 16 layers are split into 4 blocks of 4 layers, each trained to denoise its input representation. Crucially, the blocks are trained block-parallel across the two machines with essentially no gradient all-reduce - Machine A owns blocks 1-2, Machine B owns blocks 3-4.
  3. Two-machine DDP over Thunderbolt 4. Unified memory means gloo keeps pace, and DiffusionBlocks' block independence hides the modest interconnect bandwidth.

Combined, the TST + DiffusionBlocks + dual-machine phase trained roughly 10-20x faster than a conventional end-to-end autoregressive pass on the same two machines (and dramatically faster than a single accelerator) - which is what made a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter standard end-to-end phase then restores ordinary left-to-right generation; the released base weights come from that phase (EMA, decay 0.999).

Optimization

  • Optimizer: NorMuon (2D weight matrices, fp16 Newton-Schulz) + AdamW (1D params / embeddings)
  • Precision: fp16 + GradScaler (bf16 is slower / unstable on gfx1151)
  • Stabilization: z-loss 1e-4, attention soft-cap 50, EMA 0.999
  • Compile: torch.compile (max-autotune-no-cudagraphs)

Post-training

  1. SFT (full-parameter, cross-entropy) on smol-smoltalk + no_robots.
  2. SeqKD: a second SFT pass on ~10k ChatML responses generated by LFM2.5-1.2B-Instruct, which teaches the small student a cleaner, more direct answer style.

LiNeS layer-scaling and DPO were evaluated and dropped: at 138M, aggressive LiNeS removed instruction-following and DPO over-optimized into incoherence. Plain SFT + SeqKD gave the best behavior.

Limitations

  • Small model: limited reasoning and factual recall; it will state wrong facts confidently. Not for factual QA or safety-sensitive use.
  • 2,048-token context in the released inference code.
  • English-focused.
  • No RLHF / safety tuning.
  • Benchmarks above are preliminary and harness-dependent; run your own eval.

Citation

@misc{odinnext_138m_instruct_2026,
  title        = {OdinNext-138M-Instruct},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
  note         = {138M HGRN2 recurrent instruction model; TST + DiffusionBlocks +
                  dual-machine DDP pretraining on AMD Strix Halo, then SFT + SeqKD}
}

References

  • Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
  • Bowen Peng et al. Token Superposition Training. arXiv:2605.06546.
  • Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
  • Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
  • Comparison numbers and card structure inspired by Axiomic Labs' GPT-X2-125M.

Trained on AMD Strix Halo (gfx1151, RDNA 3.5), ROCm 7.13.

Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joelhenwang/OdinNext-138M-Instruct

Finetuned
(1)
this model

Papers for joelhenwang/OdinNext-138M-Instruct