SSMoELM-it

Scratch Small MoE Language Model — Instruct — instruction-tuned version of SSMoELM-Base.

  • 47M total / 25.8M active parameters (top-2 sparse routing)
  • 12.1 MB packed weights (1-bit routed experts, 4-bit attention & embedding)
  • Fine-tuned on Dolly-15k + oasst1 EN

Note: The HuggingFace model card may display ~12M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed model.safetensors directly. The actual model has 47M parameters quantized to 1-bit and 4-bit.

"Scratch" carries two meanings: built for Scratch, trained from scratch.


Model Details

Architecture Decoder-only Transformer + Sparse MoE FFN
Total params 47.04M
Active params 25.80M (per forward pass)
d_model 768
Layers 6
Attention GQA — 12 heads, kv_heads=3, head_dim=64
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU
MoE 8 routed experts + 1 shared expert, top-2 routing
d_ff (per expert) 256
Vocabulary 8,192 (BPE, byte-fallback, English-optimized)
Context length 2,048 tokens
Base model SSMoELM-Base (900M token pretrain)
Framework MLX (training) / PyTorch (inference)

Quantization Scheme

Same as SSMoELM-Base. See SSMoELM-Base for details.


Training

Pretraining

Dataset FineWeb-Edu-score-2 (60%) + FineWeb (40%)
Tokens 900M

Instruction Tuning (SFT)

Base checkpoint SSMoELM-Base (step 013734)
Dataset Dolly-15k (CC BY-SA 3.0) + oasst1 EN (Apache 2.0)
Samples ~39K (14.8K Dolly + 24K oasst1)
Steps 20,000
Learning rate 1e-5 (constant)
Loss Assistant tokens only

Benchmark Results (0-shot, 500 samples)

Task Shot Metric Samples Random Base Instruct Δ
HellaSwag 0-shot acc_norm 500 25% 33.4% 33.2% -0.2%
LAMBADA 0-shot acc 500 N/A 13.8% 14.8% +1.0%
PIQA 0-shot acc_norm 500 50% 53.2% 55.4% +2.2%
WinoGrande 0-shot acc 500 50% 49.6% 49.6% 0%
ARC-Easy 0-shot acc_norm 500 25% 35.0% 35.2% +0.2%
ARC-Challenge 0-shot acc_norm 500 25% 21.0% 24.0% +3.0%
BoolQ 0-shot acc 500 50% 36.2% 44.4% +8.2%
MMLU (57 tasks avg) 0-shot acc up to 500/task 25% 23.4% 23.2% -0.2%

Expert Routing Statistics

Measured on 136 tokens (8 diverse text samples), top-2 routing. Uniform load = 12.5%.

Layer E0 E1 E2 E3 E4 E5 E6 E7 CV
0 9% 7% 9% 14% 18% 21% 11% 10% 0.35
1 7% 12% 12% 13% 17% 10% 16% 12% 0.23
2 9% 19% 15% 14% 9% 15% 12% 7% 0.29
3 12% 6% 13% 9% 13% 9% 17% 20% 0.34
4 14% 10% 13% 12% 12% 21% 8% 10% 0.31
5 18% 14% 7% 14% 6% 11% 24% 7% 0.47

CV = coefficient of variation (lower = more balanced). No expert collapse observed.


Tokenizer

  • BPE, vocabulary size = 8,192
  • Byte fallback enabled (no <unk>)
  • ASCII/English-optimized segmentation

Special Tokens

Token ID Role
<bos> 0 sequence start
<eos> 1 end of sequence
<pad> 2 padding
<|system|> 3 system turn
<|user|> 4 user turn
<|assistant|> 5 assistant turn
<|eot|> 6 end of turn

Chat Template

<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>

Usage

Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.

pip install torch safetensors tokenizers

CLI (interactive chat):

python inference.py --ckpt model.safetensors

Recommended decoding defaults (chosen from a small prompt sweep + manual inspection to reduce repetition and gibberish for this 47M model):

Parameter Value
temperature 0.0
top_k 1
top_p 0.9
repetition_penalty 1.3

These are already the defaults in inference.py. For more varied but less reliable text, try --temperature 0.55 --top-k 20 --repetition-penalty 1.15.

Single-shot:

python inference.py --ckpt model.safetensors --no-chat --prompt "Hello" --max-tokens 100 \
  --temperature 0.0 --top-k 1 --repetition-penalty 1.3
from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer

model = load_packed_model("model.safetensors")
tok   = Tokenizer.from_file("tokenizer.json")

ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
    ids,
    max_new_tokens=200,
    temperature=0.0,
    top_k=1,
    repetition_penalty=1.3,
)
print(tok.decode(out))

Memory: Weights stay in packed uint8 format (12.1 MB). Peak RAM ~18 MB during inference.


License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
11.9M params
Tensor type
I32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brulee-1/SSMoELM-it

Finetuned
(1)
this model
Finetunes
1 model

Collection including brulee-1/SSMoELM-it