DeseretLM-200M

A 209.7M-parameter chat model trained from scratch on synthetic text written exclusively in the Deseret Alphabet — a 19th-century phonetic writing system for English. The model produces and consumes Deseret-only text; English is auto-translated at the input boundary.

Total training cost: ~$45 USD on a single H100.

What this is

This is, to our knowledge, the first language model trained from scratch in the Deseret Alphabet. It demonstrates that a phonetic-orthography-only LLM can be built end-to-end on a hobby budget by:

Reverse-engineering a translation pipeline (English → Deseret) and validating it against the 1869 Book of Mormon (99.965 % parity).
Synthesizing a ~11 B-token pre-training corpus from FineWeb-Edu.
Synthesizing a 200 k-conversation chat corpus from UltraChat.
Training a Llama-style transformer for ~12 hours on a single H100.

Architecture

Standard decoder-only transformer with modern components:

Hyperparameter	Value
Parameters	209,716,224
Layers	16
Hidden size (d_model)	1024
Attention heads	16
MLP intermediate (SwiGLU)	2730
Vocab size	8,192
Context length	1024
Normalization	RMSNorm
Positional encoding	RoPE (base 10000)
Tied embeddings	yes
Activation	SwiGLU

Tokenizer

Byte-level BPE with 8k vocab, trained on the full Deseret corpus. Special tokens at IDs 0–5: <|pad|> <|bos|> <|eos|> <|user|> <|assistant|> <|system|>.

Get the tokenizer at chrisjpatty/deseret-8k-bpe.

Chat template

<|bos|> <|user|> {user content tokens} <|assistant|> {assistant content tokens} <|eos|>

Multi-turn: repeat the user/assistant pair. Loss during SFT was computed only on assistant tokens + the terminal <|eos|>.

Training data

Stage	Dataset	Size	License
Pre-training	chrisjpatty/fineweb-edu-deseret	11.13 B tokens	ODC-By 1.0
SFT	chrisjpatty/ultrachat-deseret	200 k conversations	MIT

Training recipe

Pre-training (~$40, ~12 hr on 1× H100 80GB SXM):

Optimizer: AdamW (β=(0.9, 0.95), wd=0.1, eps=1e-8, fused)
LR: 3e-4 peak, cosine decay to 3e-5
Warmup: 2000 steps
Batch: 32 × grad-accum 16 × ctx 1024 = 524 288 tokens/step
Steps: 20 000 (~10.5 B tokens, ~50 tokens/parameter — well past Chinchilla-optimal)
Precision: bf16
Grad clip: 1.0
Gradient norm + parameter norm logged throughout
NaN guard with emergency checkpoint
Final loss: 2.68 train / 2.67 val

SFT (~$5, ~33 min on same pod):

LR: 1e-5 peak, cosine decay to 1e-6
Warmup: 200 steps
Batch: 16 × grad-accum 4 × max_len 1024
1 epoch over 200k UltraChat-Deseret conversations
Loss only on assistant tokens
Final loss: 1.58

Validation

The translation pipeline that produced the training data was validated against the Illinois Deseret Consortium's parallel transcription of the 1869 Book of Mormon — the authoritative published Deseret text — achieving:

100.00 % parity on the IDC spelling dictionary (1,359 entries)
99.965 % word-level parity on the full 1869 Book of Mormon (108k+ words)

The model itself was not benchmarked against standard NLP evals (these don't exist for Deseret).

Usage

import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
# Use the model code from https://github.com/chrisjpatty/deseretlm  (or vendor model/transformer.py)
from model.transformer import Transformer, TransformerConfig

# Download files
ckpt_path = hf_hub_download(repo_id="chrisjpatty/deseretlm-200m", filename="final.pt")
tok_path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")

# Load
device = torch.device("mps") if torch.backends.mps.is_available() else \
         torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
ckpt = torch.load(ckpt_path, map_location=device)
cfg = TransformerConfig(**ckpt["cfg"])
model = Transformer(cfg).to(device).eval()
model.load_state_dict({k.replace("_orig_mod.", ""): v for k, v in ckpt["model"].items()})
tok = Tokenizer.from_file(tok_path)

# Build prompt with chat template
bos, eos = tok.token_to_id("<|bos|>"), tok.token_to_id("<|eos|>")
u, a = tok.token_to_id("<|user|>"), tok.token_to_id("<|assistant|>")
prompt_des = "𐐐𐐲𐑊𐐬, 𐐸𐐭 𐐪𐑉 𐐷𐐭?"   # "Hello, who are you?"
ids = [bos, u] + tok.encode(prompt_des).ids + [a]
x = torch.tensor([ids], dtype=torch.long, device=device)

# Generate
with torch.no_grad():
    for _ in range(256):
        logits, _ = model(x)
        next_id = int(torch.multinomial(torch.softmax(logits[0, -1] / 0.7, -1), 1).item())
        x = torch.cat([x, torch.tensor([[next_id]], device=device)], dim=1)
        if next_id == eos:
            break

reply = tok.decode(x[0, len(ids):].tolist())
print(reply)

Known limitations

This is a small from-scratch model on a tight budget. Expect:

Coherent Deseret, sometimes off-topic answers. The language is fluent but instruction-following is weak. E.g., asked "who are you?" the model may answer about music players. It learned the chat format well; it learned what to say less well.
No factual knowledge guarantees at this scale. 200M parameters trained on 11B tokens has limited capacity for facts.
Modern American English phonology, not 1860s New England. Words like "ask" are rendered /æsk/ (modern) not /ɑsk/ (period). The translator is internally consistent but stylistically modern.
No safety tuning, no RLHF/DPO. Single SFT pass over UltraChat-200k only.
Limited multi-turn coherence beyond ~3 turns.

Reproduction

Full code, including the translator, training scripts, and validation harness, lives in the project repo (linked from the citation below). Total compute budget: under $50 on rented H100. Time: ~24 hours wall, including data prep on a single Mac.

Citation

DeseretLM-200M: a small language model trained from scratch in the Deseret Alphabet.
Christopher Patty (chrisjpatty), 2026.

If you build on the datasets, please also cite the source corpora — see the dataset cards for fineweb-edu-deseret and ultrachat-deseret.

Downloads last month: -; Downloads are not tracked for this model. How to track