Rodan-10M-Chat

A 10.41M-parameter chat model trained on one Apple M2 with MLX. It starts from the leaner Rodan-10M v9 base (PLE-free) and folds conversational data into a short continued-pretrain stage, which adds chat ability without throwing away what the base already knew.

Part of the Rodan-10M series. The base model (Rodan-10M-Base, v6) lives in its own repo; this is the chat stage built on the v9 checkpoint.

What it is

A ~10M chat model. With the right decoding settings it answers simple instructions coherently and stays on topic, which is more than I expected from something this small. It is not a production assistant. Its world knowledge is thin, its arithmetic is shaky, and it will make things up. Treat it as a demo of how much chat ability you can fit in 10M params, not as something to rely on.

USER:  What is the capital of France?
RODAN: The capital of France is Paris. It's a city known as the country, which has its own unique
       and cultural significance in Europe.

USER:  Give me a tip for staying healthy.
RODAN: You can use a regular exercise routine to manage your body and improve overall well-being.
       It's important ... maintaining healthy weight management by burning calories.

USER:  Why do we sleep?
RODAN: Sleep can be used to improve cognitive function and memory.

How it was built

At 10M params, ordinary supervised finetuning costs you base capability. We saw it in earlier Rodan experiments: a masked-SFT stage dropped board avg from about 34 to 32, the SFT tax. So instead of a separate masked-SFT stage, Rodan-Chat folds the instruction data into a continued-pretrain run mixed with 45% replay of the base's own domains (the approach Falcon used). The replay is what keeps the model from forgetting. Chat ability gets added while commonsense, science, and arithmetic stay roughly where they were.

Warm-start: Rodan-10M v9 (PLE-free, 10.41M). The tied embedding grows 8192→8194 for 2 ChatML tokens.
Data (73M tokens): 40M smol-smoltalk conversations in ChatML, plus 33M curated replay, full-sequence LM loss.
Optimizer: Muon on the 2D weights, AdamW elsewhere, low LR (1.2e-3, Muon 7e-3, below the base run), cosine, 6000 steps.
Result: perplexity dropped 24.9 → 14.6, and the base board avg held at 35.04.

Source	Share	Role
smol-smoltalk (ChatML)	55%	instruction / multi-turn chat
Cosmopedia (replay)	9%	commonsense anchor
dolmino pes2o + StackExchange (replay)	9%	knowledge anchor
synthetic arithmetic (replay)	9%	computation anchor
FineMath (replay)	9%	math anchor
science-QA (replay)	9%	science-MC anchor

Architecture

Same as the base: decoder-only, dim 320, 8 layers, 8 heads, MQA with 1 KV head, SwiGLU 768, RMSNorm, RoPE base 200k, QK-norm, tied embeddings, value-residual, LRM. No PLE, since the probe on the base showed it was dead. Vocab is 8194 (the 8k byte-BPE set plus <|im_start|> and <|im_end|>).

Evaluation

The base capability held; there was no SFT-tax collapse. Zero-shot lm-eval, limit 1000, ChatML-wrapped:

Task	Metric	Rodan-Chat	v9 base	Δ
HellaSwag	acc_norm	31.7	30.1	+1.6
ARC-Easy	acc_norm	35.3	35.4	≈
ARC-Challenge	acc_norm	22.4	22.2	≈
PIQA	acc	53.8	55.5	−1.7
ArithMark-2	acc	25.8	28.4	−2.6
Board avg (÷4)		35.04	35.70	−0.66

The 0.66 dip is partly just the ChatML wrapper hurting multiple-choice loglikelihood, and it's nowhere near the 34→32 drop a naive finetune would have caused. The replay did its job.

For instruction following itself, IFEval is close to useless at 10M: it grades strict constraint compliance, which really needs a model two or three orders of magnitude larger. So we measured the thing we actually care about instead. On 24 instruction prompts, an LLM judge compared Rodan-Chat against the v9 base, both decoded with the same repetition penalty. Chat won 14, tied 9, and lost 1, for a 93% win-rate excluding ties. The base tended to lose by sliding into code or rambling, while Chat gave coherent on-topic answers, several of them correct (Paris, photosynthesis producing glucose, the opposite of hot being cold, sleep helping memory).

We skipped a full IFEval score on purpose. It grades strict format compliance, which a 10M model fails near-uniformly, so the number carries no signal and isn't worth the long generative eval. The win-rate above is the instruction-following metric we trust at this scale.

Usage

Wrap prompts in ChatML and decode with a repetition penalty. Tiny models loop badly under pure greedy decoding, and the penalty is the difference between gibberish and readable answers.

ctx = f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
# greedy + repetition_penalty 1.3 + no-repeat-3gram ; stop on <|im_end|> (8193) or <|endoftext|> (0)

The settings I'd recommend: greedy, repetition_penalty=1.3, no_repeat_ngram=3, max_new≈70, low or zero temperature.

Limitations

~10M params, English only, for research and teaching. Don't use it in production, for factual queries, or for advice.
Thin world knowledge, weak arithmetic, prone to making things up, near chance on abstract reasoning.
It needs a repetition penalty to stay coherent; pure greedy decoding loops.
No safety alignment. It imitates the shape of a chat reply without being a reliable assistant.