DeseretLM-200M
A 209.7M-parameter chat model trained from scratch on synthetic text written exclusively in the Deseret Alphabet β a 19th-century phonetic writing system for English. The model produces and consumes Deseret-only text; English is auto-translated at the input boundary.
Total training cost: ~$45 USD on a single H100.
What this is
This is, to our knowledge, the first language model trained from scratch in the Deseret Alphabet. It demonstrates that a phonetic-orthography-only LLM can be built end-to-end on a hobby budget by:
- Reverse-engineering a translation pipeline (English β Deseret) and validating it against the 1869 Book of Mormon (99.965 % parity).
- Synthesizing a ~11 B-token pre-training corpus from FineWeb-Edu.
- Synthesizing a 200 k-conversation chat corpus from UltraChat.
- Training a Llama-style transformer for ~12 hours on a single H100.
Architecture
Standard decoder-only transformer with modern components:
| Hyperparameter | Value |
|---|---|
| Parameters | 209,716,224 |
| Layers | 16 |
| Hidden size (d_model) | 1024 |
| Attention heads | 16 |
| MLP intermediate (SwiGLU) | 2730 |
| Vocab size | 8,192 |
| Context length | 1024 |
| Normalization | RMSNorm |
| Positional encoding | RoPE (base 10000) |
| Tied embeddings | yes |
| Activation | SwiGLU |
Tokenizer
Byte-level BPE with 8k vocab, trained on the full Deseret corpus. Special tokens at IDs 0β5: <|pad|> <|bos|> <|eos|> <|user|> <|assistant|> <|system|>.
Get the tokenizer at chrisjpatty/deseret-8k-bpe.
Chat template
<|bos|> <|user|> {user content tokens} <|assistant|> {assistant content tokens} <|eos|>
Multi-turn: repeat the user/assistant pair. Loss during SFT was computed only on assistant tokens + the terminal <|eos|>.
Training data
| Stage | Dataset | Size | License |
|---|---|---|---|
| Pre-training | chrisjpatty/fineweb-edu-deseret | 11.13 B tokens | ODC-By 1.0 |
| SFT | chrisjpatty/ultrachat-deseret | 200 k conversations | MIT |
Training recipe
Pre-training (~$40, ~12 hr on 1Γ H100 80GB SXM):
- Optimizer: AdamW (Ξ²=(0.9, 0.95), wd=0.1, eps=1e-8, fused)
- LR: 3e-4 peak, cosine decay to 3e-5
- Warmup: 2000 steps
- Batch: 32 Γ grad-accum 16 Γ ctx 1024 = 524 288 tokens/step
- Steps: 20 000 (~10.5 B tokens, ~50 tokens/parameter β well past Chinchilla-optimal)
- Precision: bf16
- Grad clip: 1.0
- Gradient norm + parameter norm logged throughout
- NaN guard with emergency checkpoint
- Final loss: 2.68 train / 2.67 val
SFT (~$5, ~33 min on same pod):
- LR: 1e-5 peak, cosine decay to 1e-6
- Warmup: 200 steps
- Batch: 16 Γ grad-accum 4 Γ max_len 1024
- 1 epoch over 200k UltraChat-Deseret conversations
- Loss only on assistant tokens
- Final loss: 1.58
Validation
The translation pipeline that produced the training data was validated against the Illinois Deseret Consortium's parallel transcription of the 1869 Book of Mormon β the authoritative published Deseret text β achieving:
- 100.00 % parity on the IDC spelling dictionary (1,359 entries)
- 99.965 % word-level parity on the full 1869 Book of Mormon (108k+ words)
The model itself was not benchmarked against standard NLP evals (these don't exist for Deseret).
Usage
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
# Use the model code from https://github.com/chrisjpatty/deseretlm (or vendor model/transformer.py)
from model.transformer import Transformer, TransformerConfig
# Download files
ckpt_path = hf_hub_download(repo_id="chrisjpatty/deseretlm-200m", filename="final.pt")
tok_path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")
# Load
device = torch.device("mps") if torch.backends.mps.is_available() else \
torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
ckpt = torch.load(ckpt_path, map_location=device)
cfg = TransformerConfig(**ckpt["cfg"])
model = Transformer(cfg).to(device).eval()
model.load_state_dict({k.replace("_orig_mod.", ""): v for k, v in ckpt["model"].items()})
tok = Tokenizer.from_file(tok_path)
# Build prompt with chat template
bos, eos = tok.token_to_id("<|bos|>"), tok.token_to_id("<|eos|>")
u, a = tok.token_to_id("<|user|>"), tok.token_to_id("<|assistant|>")
prompt_des = "ππ²ππ¬, πΈπ πͺπ π·π?" # "Hello, who are you?"
ids = [bos, u] + tok.encode(prompt_des).ids + [a]
x = torch.tensor([ids], dtype=torch.long, device=device)
# Generate
with torch.no_grad():
for _ in range(256):
logits, _ = model(x)
next_id = int(torch.multinomial(torch.softmax(logits[0, -1] / 0.7, -1), 1).item())
x = torch.cat([x, torch.tensor([[next_id]], device=device)], dim=1)
if next_id == eos:
break
reply = tok.decode(x[0, len(ids):].tolist())
print(reply)
Known limitations
This is a small from-scratch model on a tight budget. Expect:
- Coherent Deseret, sometimes off-topic answers. The language is fluent but instruction-following is weak. E.g., asked "who are you?" the model may answer about music players. It learned the chat format well; it learned what to say less well.
- No factual knowledge guarantees at this scale. 200M parameters trained on 11B tokens has limited capacity for facts.
- Modern American English phonology, not 1860s New England. Words like "ask" are rendered /Γ¦sk/ (modern) not /Ιsk/ (period). The translator is internally consistent but stylistically modern.
- No safety tuning, no RLHF/DPO. Single SFT pass over UltraChat-200k only.
- Limited multi-turn coherence beyond ~3 turns.
Reproduction
Full code, including the translator, training scripts, and validation harness, lives in the project repo (linked from the citation below). Total compute budget: under $50 on rented H100. Time: ~24 hours wall, including data prep on a single Mac.
Citation
DeseretLM-200M: a small language model trained from scratch in the Deseret Alphabet.
Christopher Patty (chrisjpatty), 2026.
If you build on the datasets, please also cite the source corpora β see the dataset cards for fineweb-edu-deseret and ultrachat-deseret.