BabyLM 2026 — MultiLingual track baseline (byte-premium-uniform)

A 110M-param Llama-style decoder pre-trained from scratch on the BabyBabelLM trilingual corpus (English, Dutch, Chinese), under the BabyLM 2026 MultiLingual track rules: 100M reference tokens, byte-premium adjusted, ≤10 epochs.

This is the baseline zero-point of our ablation grid. Subsequent runs vary the mixture allocation (loss-weighted, simultaneous-bilingual, typological-bridge curriculum, register-controlled) on top of an identical scaffold. The matching ablation paper is in preparation.

Architecture

  • Llama (HF LlamaForCausalLM) — RoPE, RMSNorm, SwiGLU, no biases, tied embeddings
  • 12 layers · 768 hidden · 12 heads · 2048 FFN
  • 1024 sequence length
  • 110,119,680 parameters

Tokenizer

Joint byte-level BPE, 32,768 vocab, trained on a balanced 50M-char sample from each of EN/NL/ZH. The same tokenizer is shared across all three languages (see the data card for why a joint tokenizer is required: ZH is 6.8% Latin script).

Training

  • Data: BabyLM-community/babylm-eng + babylm-nld + babylm-zho (BabyBabelLM 2026 100M tier). Full corpora loaded in memory and shuffled (the Hub layout is category-clustered; streaming with reasonable buffers produces a biased sample).
  • Mixture: byte-premium-uniform — equal share of reference tokens per language (1/3 each), achieved by deficit-driven selection, not uniform doc sampling (mean doc sizes differ across languages).
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, wd=0.1), lr 6e-4, cosine to 10%, 100-step warmup
  • Compute: 4× NVIDIA A10G (23 GB), bf16, DDP, micro-batch 16 × grad-accum 2 (eff. batch 128 sequences = 131k tokens/step)
  • Tokens consumed at this checkpoint: 100,000,000 byte-premium-adjusted reference tokens
  • Per-language epochs at this checkpoint: ≈1.0 each (within the BabyLM ≤10-epoch cap)

Revisions

The chck_{N}M revisions match the BabyLM eval pipeline's fast-eval naming:

chck_1M, chck_2M, ..., chck_9M, chck_10M, chck_20M, ..., chck_90M, chck_100M

Use revision=chck_NM to load any milestone. The default (main) is chck_100M.

How to evaluate

git clone https://github.com/babylm-org/babylm-eval
cd babylm-eval/multilingual
bash scripts/zeroshot_model.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M
bash scripts/zeroshot_model_fast_all.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M

Citation

@misc{babylm-2026-uniform,
  title  = {BabyLM 2026 MultiLingual baseline (byte-premium-uniform)},
  author = {Hossain, Shamima},
  year   = {2026},
  url    = {https://huggingface.co/Shamima/babylm-2026-multilingual-uniform-100M}
}

Companion repo with audit, scaffold, and ablation configs: https://github.com/silvererudite/bb-lm-challenge-sub

Downloads last month
206
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support