Instructions to use itzune/zeineuski with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use itzune/zeineuski with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("itzune/zeineuski", "model.bin")) - Notebooks
- Google Colab
- Kaggle
Zeineuski — Basque Dialect Identification
Fine-grained dialect identification (DID) system for Basque (Euskara). Given a text or speech sample, classifies it into one of six dialect categories: Western (Bizkaiera), Central (Gipuzkera), Navarrese, Navarrese-Labourdin, Souletin (Zuberera), or Standard Basque (Batua).
Source code: github.com/itzune/zeineuski
Architecture
Zeineuski uses a three-tier hierarchical classification architecture:
Tier 1: batua / dialectal (binary)
└─ Tier 2: 5-class euskalkia (dialect classification)
└─ Tier 3: 9 to 12-class azpieuskalkia (sub-dialect classification)
Classification taxonomy
The project follows Koldo Zuazo's dialect classification, which is the current linguistic consensus and the basis for Ahotsak.eus's municipality→dialect mapping.
Zuazo recognizes 6 euskalkiak (dialects):
| # | Euskalkia | Our label | Notes |
|---|---|---|---|
| 1 | Bizkaiera / Mendebalekoa | western |
|
| 2 | Gipuzkera / Erdialdekoa | central |
|
| 3 | Goi-nafarrera | navarrese |
Upper Navarrese |
| 4 | Ekialdeko nafarrera / Erronkariera | (merged into navarrese) | Extinct ~1990s; tiny data |
| 5 | Zuberera | souletin |
|
| 6 | Nafar-lapurtera | nav-lab |
|
| + | Euskara batua | batua |
Standard unified Basque |
Why 5 euskalkis + batua instead of 6 + batua?
Ekialdeko nafarrera (Salazarese/Roncalese) is linguistically a distinct dialect, but
it has been functionally extinct since the 1990s (last native speaker died in 1991).
Ahotsak.eus has only ~65 passages across 7 towns in the Zaraitzu and Erronkari valleys.
The Klasikoak.armiarma.eus classical literature corpus — which provides most of our
Tier-2 training data — maps these texts to navarrese since the dialect distinction
is not present in pre-20th-century literary sources.
For Tier 3 (azpieuskalkia), we follow the Zuazo azpieuskalki taxonomy as implemented on Ahotsak.eus. The official Ahotsak municipality→ azpieuskalki mapping provides the ground truth labels for sub-dialect classification.
Models
Euskalki (Dialect) Classification — 5 euskalkis + batua (6-class)
Hierarchical 2-step classifier (binary batua/dialectal → 5-class euskalkiak):
| Variant | Filename | Size | XNLI (3-class) | Test (4-class) | Batua F1 |
|---|---|---|---|---|---|
| final | hier_binary_final.bin + hier_dialect_final.bin |
1.5GB | 92.42% | 95.18% | 0.962 |
| quantized | hier_*_quantized.bin |
417MB | 92.38% | 95.16% | 0.961 |
| compact | hier_*_compact.bin |
189MB | 91.78% | 94.71% | 0.957 |
| tiny | hier_*_tiny.bin |
112MB | 91.90% | 94.88% | 0.961 |
| web | hier_binary_web.bin + hier_dialect_web.bin |
32MB | 91.06% | 94.33% | 0.952 |
Per-class F1 (final): Western 0.953, Central 0.933, Nav-Lab 0.949, Batua 0.962.
Azpieuskalki (Sub-Dialect) Classification — 9 to 12-class
Fine-grained sub-dialect classifier trained on Ahotsak.eus oral history transcriptions.
| Variant | Filename | Classes | Accuracy |
|---|---|---|---|
| 12-class (all) | azpieuskalki.bin (233MB) |
12 | 82.08% |
| 9-class (min_samples=600) | azpieuskalki.bin (233MB) |
9 | 83.55% |
Compression variants:
| Variant | Filename | Accuracy | Size | vs original |
|---|---|---|---|---|
| original | azpieuskalki.bin |
83.59% | 233MB | baseline |
| quantized | azpieuskalki_q.bin |
83.28% | 31MB | -0.31pp, 7.5× smaller |
| bucket=50K | azpieuskalki_b50000.bin |
83.22% | 119MB | -0.37pp, 2× smaller |
| bucket=50K Q | azpieuskalki_b50000_q.bin |
82.80% | 17MB | -0.79pp, 13.7× smaller |
Usage
import fasttext
# Load a model
model = fasttext.load_model("azpieuskalki.bin")
# Predict
text = "Eta ehuna etxian eitten zan?"
labels, probs = model.predict(text, k=3)
print(labels[0].replace("__label__", ""), probs[0])
Or use the zeineuski CLI from the source repo:
uv run zeineuski predict --text "Gaur goizean goiz jaiki naiz"
Web Demo
Try it in your browser — no server, no install:
34MB of fastText models running via WebAssembly. Works offline after first load.
Training
Optimal hyperparameters (discovered via pi-autoresearch, 37 experiments over 3 sessions):
Euskalki binary + 5-class:
# Step 1 — batua vs dialectal
fasttext supervised -input train_binary.txt -output hier_binary \
-lr 3.0 -epoch 50 -dim 100 -minn 3 -maxn 6 -wordNgrams 2
# Step 2 — 5 euskalkis (no batua)
fasttext supervised -input train_dialectal_5class.txt -output hier_dialect \
-lr 0.2 -epoch 150 -dim 100 -minn 3 -maxn 6 -wordNgrams 2
Azpieuskalki 9-class:
fasttext supervised -input train_azpieuskalki.txt -output azpieuskalki \
-dim 200 -lr 0.2 -epoch 75 -wordNgrams 2 -minn 2 -maxn 6 -loss ns
Key insight: NO autotune — aggressive LR decay overfits to dominant classes. Character n-grams (minn=2,maxn=6) capture Basque morphological patterns (case endings, verb suffixes) that are dialect-specific (+9.4pp improvement).
License
MIT
- Downloads last month
- 809