Instructions to use olaverse/mist-encoder-base-ng with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use olaverse/mist-encoder-base-ng with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="olaverse/mist-encoder-base-ng")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("olaverse/mist-encoder-base-ng") model = AutoModelForMaskedLM.from_pretrained("olaverse/mist-encoder-base-ng") - Notebooks
- Google Colab
- Kaggle
mist-encoder-base-ng
A small (30.9M-parameter) modern encoder specialised for Nigerian languages —
Hausa (ha), Yoruba (yo), Igbo (ig), and Nigerian Pidgin (pcm) — pretrained from
scratch with a masked-language-modeling (MLM) objective using the unified
olaverse/otk-bpe-50k (Naija) tokenizer.
It is a deliberate specialist: a compact base you attach task heads to (classification, NER, language-ID, sentence embeddings). It is not intended to compete on raw task accuracy with larger multilingual or African-language encoders — its value is efficiency, a low-fertility Nigerian tokenizer, explicit Pidgin support, 0% UNK, and a clean Apache-2.0 release.
TL;DR — what it is and isn't
- Strong on sentence-level tasks (topic/sentiment classification) relative to its size.
- Efficient: 30.9M parameters vs 126M (AfriBERTa) / 178M (mBERT) / 270M (XLM-R).
- Tokenizer edge: lower fertility than general multilingual tokenizers on Nigerian text.
- Limited on token-level tasks (NER): trails larger specialists by ~10–20 F1. This is structural (tokenizer fragmentation + model capacity), not a tuning artifact. See Limitations.
Intended use
Load the encoder body and attach a head:
from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained("olaverse/mist-encoder-base-ng")
enc = AutoModel.from_pretrained("olaverse/mist-encoder-base-ng")
Good fits: topic/sentiment/language-ID classification, sentence embeddings (contrastive fine-tuning), and on-device / low-resource deployment where 28–30M params matters. NER is supported but weaker than larger models (see below).
Training data
All sources are commercial-friendly (attribution-only), consistent with the Apache-2.0 release:
| Source | License | Role |
|---|---|---|
| FineWeb-2 (ha/yo/ig/pcm) | ODC-By | Web text |
| castorini/wura (Nigerian subset) | Apache-2.0 | Audited mC4 + news |
| asr-nigerian-pidgin/nigerian-pidgin-1.0 | CC-BY-4.0 | Fresh Pidgin sentences |
FineWeb-2 and WURA both descend from Common Crawl / mC4, so documents were cross-deduped. The corpus was language-balanced (abundant Hausa capped; scarce Igbo/Pidgin taken in full, with the smallest language lightly upsampled) and chunked into 254-token windows so all text was used rather than truncating each document. Final training corpus: ~480k chunks.
Training details
- Objective: masked language modeling (15% masking), from random init.
- Architecture: ModernBERT — hidden 384, 6 layers, 6 heads, FFN 1152, max positions 1024.
- Tokenizer:
olaverse/otk-bpe-50kunified Naija — byte-level BPE, ~50k vocab, 0% UNK, NFC diacritic preservation, code-mixed English support. - Schedule: 16 epochs (~60k steps), batch size 128, bf16, AdamW, cosine LR 1e-4, 500 warmup.
- Result: final train MLM loss 2.06, held-out eval loss ~2.21. Eval loss decreased monotonically and plateaued — no overfitting. (In hindsight ~11 epochs would have reached ~95% of the quality; 16 was more than this corpus needed.)
- Parameters: 30.9M total; the ~50k-token embedding table is roughly two-thirds of that, so the transformer itself is only ~11M.
Evaluation
Three benchmarks, all four languages, compared against AfriBERTa (v2, 126M) and mBERT (178M). Numbers are honest and include where the model is weaker.
1. Tokenizer fertility (tokens/word — lower is better)
From the otk-bpe-50k unified-Naija benchmark (MasakhaNEWS):
| Tokenizer | Hausa | Yoruba | Igbo | Pidgin |
|---|---|---|---|---|
| otk-bpe-50k (ours) | 1.231 | 1.296 | 1.416 | 1.249 |
| GPT-4o (o200k) | 1.589 | 1.687 | 1.807 | 1.304 |
| AfroXLMR | 1.604 | 2.277 | 2.570 | 1.401 |
Lower fertility = more signal per token at a fixed sequence length. The tokenizer beats general multilingual tokenizers on all four languages.
2. Topic classification — MasakhaNEWS (macro-F1, max_length 512)
| Model | Params | Hausa | Yoruba | Igbo | Pidgin |
|---|---|---|---|---|---|
| mist-encoder-base-ng | 30.9M | 0.878 | 0.859 | 0.803 | 0.898 |
| AfriBERTa | 126M | 0.924 | 0.921 | 0.914 | 0.991 |
| mBERT | 178M | 0.806 | 0.886 | 0.805 | 0.967 |
Competitive at a fraction of the size — beats mBERT on Hausa, ties on Igbo, trails AfriBERTa.
3. Named-entity recognition — MasakhaNER 2.0 (entity-F1, seqeval, max_length 512)
| Model | Params | Hausa | Yoruba | Igbo | Pidgin |
|---|---|---|---|---|---|
| mist-encoder-base-ng | 30.9M | 0.656 | 0.779 | 0.804 | 0.729 |
| AfriBERTa | 126M | 0.850 | 0.867 | 0.897 | 0.886 |
| mBERT | 178M | 0.810 | 0.837 | 0.855 | 0.881 |
The model trails both baselines on NER. This is the honest weak spot — see below.
Limitations
- Token-level tasks (NER) are the weakness. The gap to larger models is ~10–20 entity-F1 and is structural, not a tuning artifact: it persists across seeds (std 0.005) and is unchanged by labeling all subwords vs first-subword-only. Two causes: (a) the unified 50k tokenizer fragments entity words more than language-specific tokenizers — on Hausa NER text, ~61% of entity words split into multiple subwords (vs ~21% for AfriBERTa), so per-token representations carry less whole-word meaning; (b) at 30.9M parameters the model has less capacity to reassemble meaning from fragments than a 126M model. Use a larger model if NER accuracy is critical.
- Hausa NER is notably low (0.656). Fragmentation on the MasakhaNER Hausa corpus is high (~1.52 subwords/word, vs ~1.23 on the tokenizer's MasakhaNEWS benchmark), suggesting an orthography/domain mismatch worth investigating for a future version.
- Nigerian Pidgin pretraining data is scarce. Clean, permissively-licensed Pidgin text is limited; the Pidgin slice was lightly upsampled. Treat Pidgin as supported but thinner than the other three.
- Small model. Best for sentence-level understanding and efficient deployment, not as a drop-in for the strongest available African-language encoders on hard tasks.
What would improve a v2
Evidence-backed, in priority order: (1) more model capacity (~50–80M) — NER is where the parameter gap bit hardest; (2) a less-fragmenting tokenizer for token tasks (larger vocab or per-language merge budgets); (3) more pretraining data, especially Pidgin and Hausa. Longer pretraining is not a lever — eval loss already plateaued by ~epoch 11.
License
Apache-2.0. Training data is attribution-only (ODC-By / Apache-2.0 / CC-BY-4.0); please retain attribution to the upstream datasets.
Acknowledgements & citations
Built with FineWeb-2, WURA (Oladipo et al., EMNLP 2023), and the Nigerian Pidgin ASR corpus. Evaluated on MasakhaNEWS and MasakhaNER 2.0 (Adelani et al., Masakhane). AfriBERTa (Ogueji et al., 2021) and mBERT (Devlin et al., 2019) used as comparison baselines.
- Downloads last month
- 19