Instructions to use joelhenwang/OdinNext-138M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Base", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Base

SGLang

How to use joelhenwang/OdinNext-138M-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Base with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Base
```

OdinNext-138M-Base

OdinNext is a 138M-parameter causal language model that replaces softmax self-attention with an HGRN2-style gated linear recurrence. This repository is the base pretrained model — trained from scratch on ~101.6B tokens of curated data (the Dolmino mix) on two AMD Strix Halo (gfx1151) machines.

This is a base model: it completes and continues text. It is not an instruction-tuned or chat model — no SFT, DPO, RLHF, or chat template. An instruction-tuned variant is available at joelhenwang/OdinNext-138M-Instruct.

Repo: joelhenwang/OdinNext-138M-Base
main: EMA-shadowed weights (decay 0.999), recommended.
live: raw training weights at the same step.
Context window: 2,048 tokens in the released inference code.
License: Apache-2.0.

Uses custom Transformers code. Loading with trust_remote_code=True executes Python from this repo. Review the files or pin a commit before trusting it.

At a glance

Item	Value
Unique tied parameters	138,449,696
Non-embedding parameters	113,283,872
Layers	16
Hidden size	768
Heads	6
Head state dims	128 × 128 per head
FFN inner size	2,048
Vocabulary	32,768 custom BPE tokens
Max sequence length	2,048
Checkpoint dtype	fp16
Architecture	HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm
Cache type	Fixed-size recurrent state, not a growing KV cache

Architecture

Decoder-only causal LM, 16 identical pre-norm blocks:

x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(ZCRMSNorm(x))

The HGRN2 recurrent state updates per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] = [B, 6, 128, 128]. This state is constant in size with respect to context length, giving O(1)-per-token decoding rather than a growing KV cache.

Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16 the recurrent state is constant:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB

independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). This is a cache-state comparison only, not a claim about total memory or usable context.

Training snapshot

Field	Value
Data	Dolmino mix (~101.6B tokens, odin-32k tokenizer)
Hardware	2× AMD Strix Halo / gfx1151, ROCm 7.13
Interconnect	Thunderbolt 4, DDP over gloo
Precision	fp16 + GradScaler
Optimizers	NorMuon (2D tensors) + AdamW (1D / embeddings)
LR	peak 8e-4, warmup, cosine decay
Stabilization	z-loss 1e-4, attention soft-cap 50, EMA decay 0.999
Curriculum	Phase 1: Token-Superposition Training (bag-size 4) + DiffusionBlocks (block-wise) for ~24K steps; Phase 2: standard end-to-end autoregressive recovery
Released weights	`main` = `ema_state_dict`; `live` = raw online weights

The two-phase curriculum trains most of the budget under a block-wise DiffusionBlocks + token-superposition objective for throughput, then recovers ordinary left-to-right generation with a standard end-to-end phase. The released weights are from the end-to-end recovery phase and produce coherent continuations.

Data & curation

Pretraining used the Dolmino mix (allenai/dolma3_dolmino_mix-100B-1025), curated by dropping the synthetic / noisy partitions and keeping the natural text + code:

Excluded: all synthetic reasoning-trace subsets (Gemini / QwQ / R1 / OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite, verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
Kept: natural web text, code (stack-edu, cranecode; FIM markers stripped), math, and reference text — the mix's native proportions minus the exclusions.
Tokenizer: custom 32K BPE (odin-32k); ~101.6B tokens after tokenization.

How we accelerated pretraining

Pretraining ran on two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5) mini-PCs (128 GB unified LPDDR5X each), over Thunderbolt 4 with DDP on the gloo backend. Three techniques compounded:

TST — Token Superposition Training (bag-size 4): each position is the mean of 4 stochastic sub-word tokenizations of the same text, so the model digests ~4× the tokens per step; the bag size anneals 4 → 2 → 1 over training.
DiffusionBlocks (B=4): the 16 layers form 4 four-layer blocks trained to denoise their input, block-parallel across the two machines with essentially no gradient all-reduce (Machine A: blocks 1–2; Machine B: blocks 3–4) — ideal for a single Thunderbolt link.
Two-machine DDP over TB4: unified memory lets gloo keep pace, and the block independence hides the modest interconnect bandwidth.

Together this phase trained roughly 10–20× faster than a conventional end-to-end autoregressive pass on the same two machines (and far faster than a single accelerator) — which is what made a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter standard end-to-end phase then restores ordinary generation; the released weights (EMA, decay 0.999) come from it.

Results

Zero-shot, our own harness (scripts/eval_benchmarks.py; HellaSwag = acc_norm, ARC = mean of Easy + Challenge acc, PIQA = acc). Other rows are as reported by Axiomic Labs on the GPT-X2-125M card and are not perfectly comparable (different harness).

Company	Model	HellaSwag	ARC (avg)	PIQA	Training tokens
HuggingFace	SmolLM2-135M	43.22%	44.62%	67.52%	2T
Axiomic Labs	GPT-X2-125M	40.55%	39.90%	66.97%	75B
OpenAI	GPT-2 (124M)	31.49%	31.40%	63.28%	~10B
EleutherAI	Pythia-160M	30.46%	29.95%	57.94%	~225B
Facebook	OPT-125M	31.39%	31.53%	62.02%	180B
EleutherAI	GPT-Neo-125M	30.55%	31.43%	61.75%	300B
This work	OdinNext-138M-Base	33.05%	34.29%	58.81%	101.6B

OdinNext lands in the GPT-2 / OPT / Pythia / GPT-Neo tier here — below the SmolLM / GPT-X2 frontier, but trained end-to-end on two consumer AMD mini-PCs. An instruction-tuned variant is at joelhenwang/OdinNext-138M-Instruct.

What this model is good for

Text continuation and completion in English.
Research on compact recurrent / linear-attention LMs and fixed-state decoding.
A base for instruction tuning, alignment, and context extension.

Do not use it for chat / instruction following (not tuned yet), safety- sensitive generation, or benchmark claims without running your own evaluation.

Usage

pip install "transformers>=4.46" torch safetensors

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Base"
revision = "main"  # EMA weights; pin a commit for reproducibility

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    repo, revision=revision, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()

prompt = "The discovery of penicillin"
inputs = tok(prompt, return_tensors="pt").to(device)
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=max(0, min(100, remaining)),
        do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id, use_cache=True,
    )
print(tok.decode(out[0], skip_special_tokens=True))

Batching guidance

The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.

Limitations

Base model only: no instruction tuning, alignment, or chat template.
No safety training: outputs can be biased, false, or incoherent.
Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
attention_mask ignored in the backbone; padding affects recurrent state.
English-focused; multilingual / code ability is uncharacterized.
Benchmarks above are zero-shot on our own harness and not perfectly comparable across tooling — run your own evaluation.

Revisions

main: EMA-shadowed weights (decay 0.999), recommended for evaluation.
live: raw training weights at the same step.

Pin a commit hash rather than a moving branch for reproducible experiments.

Citation

@misc{odinnext_138m_base_2026,
  title        = {OdinNext-138M-Base},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Base}},
  note         = {138M HGRN2 recurrent language-model base checkpoint}
}

References

Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.