Instructions to use joelhenwang/OdinNext-138M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use joelhenwang/OdinNext-138M-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Base", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use joelhenwang/OdinNext-138M-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "joelhenwang/OdinNext-138M-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/joelhenwang/OdinNext-138M-Base
- SGLang
How to use joelhenwang/OdinNext-138M-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use joelhenwang/OdinNext-138M-Base with Docker Model Runner:
docker model run hf.co/joelhenwang/OdinNext-138M-Base
OdinNext-138M-Base
OdinNext is a 138M-parameter causal language model that replaces softmax self-attention with an HGRN2-style gated linear recurrence. This repository is the base pretrained model — trained from scratch on ~101.6B tokens of curated data (the Dolmino mix) on two AMD Strix Halo (gfx1151) machines.
This is a base model: it completes and continues text. It is not an
instruction-tuned or chat model — no SFT, DPO, RLHF, or chat template. An
instruction-tuned variant is available at
joelhenwang/OdinNext-138M-Instruct.
- Repo:
joelhenwang/OdinNext-138M-Base main: EMA-shadowed weights (decay 0.999), recommended.live: raw training weights at the same step.- Context window: 2,048 tokens in the released inference code.
- License: Apache-2.0.
Uses custom Transformers code. Loading with
trust_remote_code=Trueexecutes Python from this repo. Review the files or pin a commit before trusting it.
At a glance
| Item | Value |
|---|---|
| Unique tied parameters | 138,449,696 |
| Non-embedding parameters | 113,283,872 |
| Layers | 16 |
| Hidden size | 768 |
| Heads | 6 |
| Head state dims | 128 × 128 per head |
| FFN inner size | 2,048 |
| Vocabulary | 32,768 custom BPE tokens |
| Max sequence length | 2,048 |
| Checkpoint dtype | fp16 |
| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm |
| Cache type | Fixed-size recurrent state, not a growing KV cache |
Architecture
Decoder-only causal LM, 16 identical pre-norm blocks:
x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn) * SwiGLU²(ZCRMSNorm(x))
The HGRN2 recurrent state updates per token as:
S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t
with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] =
[B, 6, 128, 128]. This state is constant in size with respect to context
length, giving O(1)-per-token decoding rather than a growing KV cache.
Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.
Memory: recurrent state vs Transformer KV cache
For batch size 1 in fp16 the recurrent state is constant:
layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB
independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). This is a cache-state comparison only, not a claim about total memory or usable context.
Training snapshot
| Field | Value |
|---|---|
| Data | Dolmino mix (~101.6B tokens, odin-32k tokenizer) |
| Hardware | 2× AMD Strix Halo / gfx1151, ROCm 7.13 |
| Interconnect | Thunderbolt 4, DDP over gloo |
| Precision | fp16 + GradScaler |
| Optimizers | NorMuon (2D tensors) + AdamW (1D / embeddings) |
| LR | peak 8e-4, warmup, cosine decay |
| Stabilization | z-loss 1e-4, attention soft-cap 50, EMA decay 0.999 |
| Curriculum | Phase 1: Token-Superposition Training (bag-size 4) + DiffusionBlocks (block-wise) for ~24K steps; Phase 2: standard end-to-end autoregressive recovery |
| Released weights | main = ema_state_dict; live = raw online weights |
The two-phase curriculum trains most of the budget under a block-wise DiffusionBlocks + token-superposition objective for throughput, then recovers ordinary left-to-right generation with a standard end-to-end phase. The released weights are from the end-to-end recovery phase and produce coherent continuations.
Data & curation
Pretraining used the Dolmino mix
(allenai/dolma3_dolmino_mix-100B-1025),
curated by dropping the synthetic / noisy partitions and keeping the natural
text + code:
- Excluded: all synthetic reasoning-trace subsets (Gemini / QwQ / R1 / OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite, verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
- Kept: natural web text, code (stack-edu, cranecode; FIM markers stripped), math, and reference text — the mix's native proportions minus the exclusions.
- Tokenizer: custom 32K BPE (
odin-32k); ~101.6B tokens after tokenization.
How we accelerated pretraining
Pretraining ran on two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5) mini-PCs (128 GB unified LPDDR5X each), over Thunderbolt 4 with DDP on the gloo backend. Three techniques compounded:
- TST — Token Superposition Training (bag-size 4): each position is the mean of 4 stochastic sub-word tokenizations of the same text, so the model digests ~4× the tokens per step; the bag size anneals 4 → 2 → 1 over training.
- DiffusionBlocks (B=4): the 16 layers form 4 four-layer blocks trained to denoise their input, block-parallel across the two machines with essentially no gradient all-reduce (Machine A: blocks 1–2; Machine B: blocks 3–4) — ideal for a single Thunderbolt link.
- Two-machine DDP over TB4: unified memory lets
glookeep pace, and the block independence hides the modest interconnect bandwidth.
Together this phase trained roughly 10–20× faster than a conventional end-to-end autoregressive pass on the same two machines (and far faster than a single accelerator) — which is what made a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter standard end-to-end phase then restores ordinary generation; the released weights (EMA, decay 0.999) come from it.
Results
Zero-shot, our own harness (scripts/eval_benchmarks.py; HellaSwag = acc_norm,
ARC = mean of Easy + Challenge acc, PIQA = acc). Other rows are as reported by
Axiomic Labs on the GPT-X2-125M
card and are not perfectly comparable (different harness).
| Company | Model | HellaSwag | ARC (avg) | PIQA | Training tokens |
|---|---|---|---|---|---|
| HuggingFace | SmolLM2-135M | 43.22% | 44.62% | 67.52% | 2T |
| Axiomic Labs | GPT-X2-125M | 40.55% | 39.90% | 66.97% | 75B |
| OpenAI | GPT-2 (124M) | 31.49% | 31.40% | 63.28% | ~10B |
| EleutherAI | Pythia-160M | 30.46% | 29.95% | 57.94% | ~225B |
| OPT-125M | 31.39% | 31.53% | 62.02% | 180B | |
| EleutherAI | GPT-Neo-125M | 30.55% | 31.43% | 61.75% | 300B |
| This work | OdinNext-138M-Base | 33.05% | 34.29% | 58.81% | 101.6B |
OdinNext lands in the GPT-2 / OPT / Pythia / GPT-Neo tier here — below the
SmolLM / GPT-X2 frontier, but trained end-to-end on two consumer AMD mini-PCs.
An instruction-tuned variant is at
joelhenwang/OdinNext-138M-Instruct.
What this model is good for
- Text continuation and completion in English.
- Research on compact recurrent / linear-attention LMs and fixed-state decoding.
- A base for instruction tuning, alignment, and context extension.
Do not use it for chat / instruction following (not tuned yet), safety- sensitive generation, or benchmark claims without running your own evaluation.
Usage
pip install "transformers>=4.46" torch safetensors
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "joelhenwang/OdinNext-138M-Base"
revision = "main" # EMA weights; pin a commit for reproducibility
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
repo, revision=revision, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()
prompt = "The discovery of penicillin"
inputs = tok(prompt, return_tensors="pt").to(device)
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=max(0, min(100, remaining)),
do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1,
pad_token_id=tok.pad_token_id, use_cache=True,
)
print(tok.decode(out[0], skip_special_tokens=True))
Batching guidance
The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.
Limitations
- Base model only: no instruction tuning, alignment, or chat template.
- No safety training: outputs can be biased, false, or incoherent.
- Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
attention_maskignored in the backbone; padding affects recurrent state.- English-focused; multilingual / code ability is uncharacterized.
- Benchmarks above are zero-shot on our own harness and not perfectly comparable across tooling — run your own evaluation.
Revisions
main: EMA-shadowed weights (decay 0.999), recommended for evaluation.live: raw training weights at the same step.
Pin a commit hash rather than a moving branch for reproducible experiments.
Citation
@misc{odinnext_138m_base_2026,
title = {OdinNext-138M-Base},
author = {Wang, Joel},
year = {2026},
howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Base}},
note = {138M HGRN2 recurrent language-model base checkpoint}
}
References
- Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
- Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
- Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
- Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
- Downloads last month
- 60