SSMoELM-it
Scratch Small MoE Language Model — Instruct — instruction-tuned version of SSMoELM-Base.
- 47M total / 25.8M active parameters (top-2 sparse routing)
- 12.1 MB packed weights (1-bit routed experts, 4-bit attention & embedding)
- Fine-tuned on Dolly-15k + oasst1 EN
Note: The HuggingFace model card may display ~12M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed
model.safetensorsdirectly. The actual model has 47M parameters quantized to 1-bit and 4-bit.
"Scratch" carries two meanings: built for Scratch, trained from scratch.
Model Details
| Architecture | Decoder-only Transformer + Sparse MoE FFN |
| Total params | 47.04M |
| Active params | 25.80M (per forward pass) |
| d_model | 768 |
| Layers | 6 |
| Attention | GQA — 12 heads, kv_heads=3, head_dim=64 |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| MoE | 8 routed experts + 1 shared expert, top-2 routing |
| d_ff (per expert) | 256 |
| Vocabulary | 8,192 (BPE, byte-fallback, English-optimized) |
| Context length | 2,048 tokens |
| Base model | SSMoELM-Base (900M token pretrain) |
| Framework | MLX (training) / PyTorch (inference) |
Quantization Scheme
Same as SSMoELM-Base. See SSMoELM-Base for details.
Training
Pretraining
| Dataset | FineWeb-Edu-score-2 (60%) + FineWeb (40%) |
| Tokens | 900M |
Instruction Tuning (SFT)
| Base checkpoint | SSMoELM-Base (step 013734) |
| Dataset | Dolly-15k (CC BY-SA 3.0) + oasst1 EN (Apache 2.0) |
| Samples | ~39K (14.8K Dolly + 24K oasst1) |
| Steps | 20,000 |
| Learning rate | 1e-5 (constant) |
| Loss | Assistant tokens only |
Benchmark Results (0-shot, 500 samples)
| Task | Shot | Metric | Samples | Random | Base | Instruct | Δ |
|---|---|---|---|---|---|---|---|
| HellaSwag | 0-shot | acc_norm | 500 | 25% | 33.4% | 33.2% | -0.2% |
| LAMBADA | 0-shot | acc | 500 | N/A | 13.8% | 14.8% | +1.0% |
| PIQA | 0-shot | acc_norm | 500 | 50% | 53.2% | 55.4% | +2.2% |
| WinoGrande | 0-shot | acc | 500 | 50% | 49.6% | 49.6% | 0% |
| ARC-Easy | 0-shot | acc_norm | 500 | 25% | 35.0% | 35.2% | +0.2% |
| ARC-Challenge | 0-shot | acc_norm | 500 | 25% | 21.0% | 24.0% | +3.0% |
| BoolQ | 0-shot | acc | 500 | 50% | 36.2% | 44.4% | +8.2% |
| MMLU (57 tasks avg) | 0-shot | acc | up to 500/task | 25% | 23.4% | 23.2% | -0.2% |
Expert Routing Statistics
Measured on 136 tokens (8 diverse text samples), top-2 routing. Uniform load = 12.5%.
| Layer | E0 | E1 | E2 | E3 | E4 | E5 | E6 | E7 | CV |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 9% | 7% | 9% | 14% | 18% | 21% | 11% | 10% | 0.35 |
| 1 | 7% | 12% | 12% | 13% | 17% | 10% | 16% | 12% | 0.23 |
| 2 | 9% | 19% | 15% | 14% | 9% | 15% | 12% | 7% | 0.29 |
| 3 | 12% | 6% | 13% | 9% | 13% | 9% | 17% | 20% | 0.34 |
| 4 | 14% | 10% | 13% | 12% | 12% | 21% | 8% | 10% | 0.31 |
| 5 | 18% | 14% | 7% | 14% | 6% | 11% | 24% | 7% | 0.47 |
CV = coefficient of variation (lower = more balanced). No expert collapse observed.
Tokenizer
- BPE, vocabulary size = 8,192
- Byte fallback enabled (no
<unk>) - ASCII/English-optimized segmentation
Special Tokens
| Token | ID | Role |
|---|---|---|
<bos> |
0 | sequence start |
<eos> |
1 | end of sequence |
<pad> |
2 | padding |
<|system|> |
3 | system turn |
<|user|> |
4 | user turn |
<|assistant|> |
5 | assistant turn |
<|eot|> |
6 | end of turn |
Chat Template
<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>
Usage
Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.
pip install torch safetensors tokenizers
CLI (interactive chat):
python inference.py --ckpt model.safetensors
Recommended decoding defaults (chosen from a small prompt sweep + manual inspection to reduce repetition and gibberish for this 47M model):
| Parameter | Value |
|---|---|
temperature |
0.0 |
top_k |
1 |
top_p |
0.9 |
repetition_penalty |
1.3 |
These are already the defaults in inference.py. For more varied but less reliable text, try --temperature 0.55 --top-k 20 --repetition-penalty 1.15.
Single-shot:
python inference.py --ckpt model.safetensors --no-chat --prompt "Hello" --max-tokens 100 \
--temperature 0.0 --top-k 1 --repetition-penalty 1.3
from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer
model = load_packed_model("model.safetensors")
tok = Tokenizer.from_file("tokenizer.json")
ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
ids,
max_new_tokens=200,
temperature=0.0,
top_k=1,
repetition_penalty=1.3,
)
print(tok.decode(out))
Memory: Weights stay in packed uint8 format (12.1 MB). Peak RAM ~18 MB during inference.
License
Apache 2.0