mist-encoder-base-ng

A small (30.9M-parameter) modern encoder specialised for Nigerian languages — Hausa (ha), Yoruba (yo), Igbo (ig), and Nigerian Pidgin (pcm) — pretrained from scratch with a masked-language-modeling (MLM) objective using the unified olaverse/otk-bpe-50k (Naija) tokenizer.

It is a deliberate specialist: a compact base you attach task heads to (classification, NER, language-ID, sentence embeddings). It is not intended to compete on raw task accuracy with larger multilingual or African-language encoders — its value is efficiency, a low-fertility Nigerian tokenizer, explicit Pidgin support, 0% UNK, and a clean Apache-2.0 release.

TL;DR — what it is and isn't

Strong on sentence-level tasks (topic/sentiment classification) relative to its size.
Efficient: 30.9M parameters vs 126M (AfriBERTa) / 178M (mBERT) / 270M (XLM-R).
Tokenizer edge: lower fertility than general multilingual tokenizers on Nigerian text.
Limited on token-level tasks (NER): trails larger specialists by ~10–20 F1. This is structural (tokenizer fragmentation + model capacity), not a tuning artifact. See Limitations.

Intended use

Load the encoder body and attach a head:

from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained("olaverse/mist-encoder-base-ng")
enc = AutoModel.from_pretrained("olaverse/mist-encoder-base-ng")

Good fits: topic/sentiment/language-ID classification, sentence embeddings (contrastive fine-tuning), and on-device / low-resource deployment where 28–30M params matters. NER is supported but weaker than larger models (see below).

Training data

All sources are commercial-friendly (attribution-only), consistent with the Apache-2.0 release:

Source	License	Role
FineWeb-2 (ha/yo/ig/pcm)	ODC-By	Web text
castorini/wura (Nigerian subset)	Apache-2.0	Audited mC4 + news
asr-nigerian-pidgin/nigerian-pidgin-1.0	CC-BY-4.0	Fresh Pidgin sentences

FineWeb-2 and WURA both descend from Common Crawl / mC4, so documents were cross-deduped. The corpus was language-balanced (abundant Hausa capped; scarce Igbo/Pidgin taken in full, with the smallest language lightly upsampled) and chunked into 254-token windows so all text was used rather than truncating each document. Final training corpus: ~480k chunks.

Training details

Objective: masked language modeling (15% masking), from random init.
Architecture: ModernBERT — hidden 384, 6 layers, 6 heads, FFN 1152, max positions 1024.
Tokenizer: olaverse/otk-bpe-50k unified Naija — byte-level BPE, ~50k vocab, 0% UNK, NFC diacritic preservation, code-mixed English support.
Schedule: 16 epochs (~60k steps), batch size 128, bf16, AdamW, cosine LR 1e-4, 500 warmup.
Result: final train MLM loss 2.06, held-out eval loss ~2.21. Eval loss decreased monotonically and plateaued — no overfitting. (In hindsight ~11 epochs would have reached ~95% of the quality; 16 was more than this corpus needed.)
Parameters: 30.9M total; the ~50k-token embedding table is roughly two-thirds of that, so the transformer itself is only ~11M.

Evaluation

Three benchmarks, all four languages, compared against AfriBERTa (v2, 126M) and mBERT (178M). Numbers are honest and include where the model is weaker.

1. Tokenizer fertility (tokens/word — lower is better)

From the otk-bpe-50k unified-Naija benchmark (MasakhaNEWS):

Tokenizer	Hausa	Yoruba	Igbo	Pidgin
otk-bpe-50k (ours)	1.231	1.296	1.416	1.249
GPT-4o (o200k)	1.589	1.687	1.807	1.304
AfroXLMR	1.604	2.277	2.570	1.401

Lower fertility = more signal per token at a fixed sequence length. The tokenizer beats general multilingual tokenizers on all four languages.

2. Topic classification — MasakhaNEWS (macro-F1, max_length 512)

Model	Params	Hausa	Yoruba	Igbo	Pidgin
mist-encoder-base-ng	30.9M	0.878	0.859	0.803	0.898
AfriBERTa	126M	0.924	0.921	0.914	0.991
mBERT	178M	0.806	0.886	0.805	0.967

Competitive at a fraction of the size — beats mBERT on Hausa, ties on Igbo, trails AfriBERTa.

3. Named-entity recognition — MasakhaNER 2.0 (entity-F1, seqeval, max_length 512)

Model	Params	Hausa	Yoruba	Igbo	Pidgin
mist-encoder-base-ng	30.9M	0.656	0.779	0.804	0.729
AfriBERTa	126M	0.850	0.867	0.897	0.886
mBERT	178M	0.810	0.837	0.855	0.881

The model trails both baselines on NER. This is the honest weak spot — see below.

Limitations

Token-level tasks (NER) are the weakness. The gap to larger models is ~10–20 entity-F1 and is structural, not a tuning artifact: it persists across seeds (std 0.005) and is unchanged by labeling all subwords vs first-subword-only. Two causes: (a) the unified 50k tokenizer fragments entity words more than language-specific tokenizers — on Hausa NER text, ~61% of entity words split into multiple subwords (vs ~21% for AfriBERTa), so per-token representations carry less whole-word meaning; (b) at 30.9M parameters the model has less capacity to reassemble meaning from fragments than a 126M model. Use a larger model if NER accuracy is critical.
Hausa NER is notably low (0.656). Fragmentation on the MasakhaNER Hausa corpus is high (~1.52 subwords/word, vs ~1.23 on the tokenizer's MasakhaNEWS benchmark), suggesting an orthography/domain mismatch worth investigating for a future version.
Nigerian Pidgin pretraining data is scarce. Clean, permissively-licensed Pidgin text is limited; the Pidgin slice was lightly upsampled. Treat Pidgin as supported but thinner than the other three.
Small model. Best for sentence-level understanding and efficient deployment, not as a drop-in for the strongest available African-language encoders on hard tasks.

What would improve a v2

Evidence-backed, in priority order: (1) more model capacity (~50–80M) — NER is where the parameter gap bit hardest; (2) a less-fragmenting tokenizer for token tasks (larger vocab or per-language merge budgets); (3) more pretraining data, especially Pidgin and Hausa. Longer pretraining is not a lever — eval loss already plateaued by ~epoch 11.

License

Apache-2.0. Training data is attribution-only (ODC-By / Apache-2.0 / CC-BY-4.0); please retain attribution to the upstream datasets.

Acknowledgements & citations

Built with FineWeb-2, WURA (Oladipo et al., EMNLP 2023), and the Nigerian Pidgin ASR corpus. Evaluated on MasakhaNEWS and MasakhaNER 2.0 (Adelani et al., Masakhane). AfriBERTa (Ogueji et al., 2021) and mBERT (Devlin et al., 2019) used as comparison baselines.

Downloads last month: 19

Safetensors

Model size

30.9M params

Tensor type

F32

Model tree for olaverse/mist-encoder-base-ng

Finetunes

2 models

Datasets used to train olaverse/mist-encoder-base-ng

Space using olaverse/mist-encoder-base-ng 1

Collection including olaverse/mist-encoder-base-ng

Mist Encoder

Collection

3 items • Updated about 11 hours ago