SSMoELM-it

Scratch Small MoE Language Model — Instruct — instruction-tuned version of SSMoELM-Base.

47M total / 25.8M active parameters (top-2 sparse routing)
12.1 MB packed weights (1-bit routed experts, 4-bit attention & embedding)
Fine-tuned on Dolly-15k + oasst1 EN

Note: The HuggingFace model card may display ~12M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed model.safetensors directly. The actual model has 47M parameters quantized to 1-bit and 4-bit.

"Scratch" carries two meanings: built for Scratch, trained from scratch.

Model Details


Architecture	Decoder-only Transformer + Sparse MoE FFN
Total params	47.04M
Active params	25.80M (per forward pass)
d_model	768
Layers	6
Attention	GQA — 12 heads, kv_heads=3, head_dim=64
Positional encoding	RoPE
Normalization	RMSNorm
Activation	SwiGLU
MoE	8 routed experts + 1 shared expert, top-2 routing
d_ff (per expert)	256
Vocabulary	8,192 (BPE, byte-fallback, English-optimized)
Context length	2,048 tokens
Base model	SSMoELM-Base (900M token pretrain)
Framework	MLX (training) / PyTorch (inference)

Quantization Scheme

Same as SSMoELM-Base. See SSMoELM-Base for details.

Training

Pretraining


Dataset	FineWeb-Edu-score-2 (60%) + FineWeb (40%)
Tokens	900M

Instruction Tuning (SFT)


Base checkpoint	SSMoELM-Base (step 013734)
Dataset	Dolly-15k (CC BY-SA 3.0) + oasst1 EN (Apache 2.0)
Samples	~39K (14.8K Dolly + 24K oasst1)
Steps	20,000
Learning rate	1e-5 (constant)
Loss	Assistant tokens only

Benchmark Results (0-shot, 500 samples)

Task	Shot	Metric	Samples	Random	Base	Instruct	Δ
HellaSwag	0-shot	acc_norm	500	25%	33.4%	33.2%	-0.2%
LAMBADA	0-shot	acc	500	N/A	13.8%	14.8%	+1.0%
PIQA	0-shot	acc_norm	500	50%	53.2%	55.4%	+2.2%
WinoGrande	0-shot	acc	500	50%	49.6%	49.6%	0%
ARC-Easy	0-shot	acc_norm	500	25%	35.0%	35.2%	+0.2%
ARC-Challenge	0-shot	acc_norm	500	25%	21.0%	24.0%	+3.0%
BoolQ	0-shot	acc	500	50%	36.2%	44.4%	+8.2%
MMLU (57 tasks avg)	0-shot	acc	up to 500/task	25%	23.4%	23.2%	-0.2%

Expert Routing Statistics

Measured on 136 tokens (8 diverse text samples), top-2 routing. Uniform load = 12.5%.

Layer	E0	E1	E2	E3	E4	E5	E6	E7	CV
0	9%	7%	9%	14%	18%	21%	11%	10%	0.35
1	7%	12%	12%	13%	17%	10%	16%	12%	0.23
2	9%	19%	15%	14%	9%	15%	12%	7%	0.29
3	12%	6%	13%	9%	13%	9%	17%	20%	0.34
4	14%	10%	13%	12%	12%	21%	8%	10%	0.31
5	18%	14%	7%	14%	6%	11%	24%	7%	0.47

CV = coefficient of variation (lower = more balanced). No expert collapse observed.

Tokenizer

BPE, vocabulary size = 8,192
Byte fallback enabled (no <unk>)
ASCII/English-optimized segmentation

Special Tokens

Token	ID	Role
`<bos>`	0	sequence start
`<eos>`	1	end of sequence
`<pad>`	2	padding
`<\|system\|>`	3	system turn
`<\|user\|>`	4	user turn
`<\|assistant\|>`	5	assistant turn
`<\|eot\|>`	6	end of turn

Chat Template

<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>

Usage

Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.

pip install torch safetensors tokenizers

CLI (interactive chat):

python inference.py --ckpt model.safetensors

Recommended decoding defaults (chosen from a small prompt sweep + manual inspection to reduce repetition and gibberish for this 47M model):

Parameter	Value
`temperature`	`0.0`
`top_k`	`1`
`top_p`	`0.9`
`repetition_penalty`	`1.3`

These are already the defaults in inference.py. For more varied but less reliable text, try --temperature 0.55 --top-k 20 --repetition-penalty 1.15.

Single-shot:

python inference.py --ckpt model.safetensors --no-chat --prompt "Hello" --max-tokens 100 \
  --temperature 0.0 --top-k 1 --repetition-penalty 1.3

from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer

model = load_packed_model("model.safetensors")
tok   = Tokenizer.from_file("tokenizer.json")

ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
    ids,
    max_new_tokens=200,
    temperature=0.0,
    top_k=1,
    repetition_penalty=1.3,
)
print(tok.decode(out))

Memory: Weights stay in packed uint8 format (12.1 MB). Peak RAM ~18 MB during inference.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

11.9M params

Tensor type

I32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brulee-1/SSMoELM-it

Base model

brulee-1/SSMoELM-Base

Finetuned

(1)

this model

Finetunes

1 model

Collection including brulee-1/SSMoELM-it

SSMoELM

Collection

Models and datasets used in SSMoELM • 7 items • Updated about 20 hours ago