Zeineuski — Basque Dialect Identification

Fine-grained dialect identification (DID) system for Basque (Euskara). Given a text or speech sample, classifies it into one of six dialect categories: Western (Bizkaiera), Central (Gipuzkera), Navarrese, Navarrese-Labourdin, Souletin (Zuberera), or Standard Basque (Batua).

Source code: github.com/itzune/zeineuski

Architecture

Zeineuski uses a three-tier hierarchical classification architecture:

Tier 1: batua / dialectal (binary)
  └─ Tier 2: 5-class euskalkia (dialect classification)
       └─ Tier 3: 9 to 12-class azpieuskalkia (sub-dialect classification)

Classification taxonomy

The project follows Koldo Zuazo's dialect classification, which is the current linguistic consensus and the basis for Ahotsak.eus's municipality→dialect mapping.

Zuazo recognizes 6 euskalkiak (dialects):

# Euskalkia Our label Notes
1 Bizkaiera / Mendebalekoa western
2 Gipuzkera / Erdialdekoa central
3 Goi-nafarrera navarrese Upper Navarrese
4 Ekialdeko nafarrera / Erronkariera (merged into navarrese) Extinct ~1990s; tiny data
5 Zuberera souletin
6 Nafar-lapurtera nav-lab
+ Euskara batua batua Standard unified Basque

Why 5 euskalkis + batua instead of 6 + batua?

Ekialdeko nafarrera (Salazarese/Roncalese) is linguistically a distinct dialect, but it has been functionally extinct since the 1990s (last native speaker died in 1991). Ahotsak.eus has only ~65 passages across 7 towns in the Zaraitzu and Erronkari valleys. The Klasikoak.armiarma.eus classical literature corpus — which provides most of our Tier-2 training data — maps these texts to navarrese since the dialect distinction is not present in pre-20th-century literary sources.

For Tier 3 (azpieuskalkia), we follow the Zuazo azpieuskalki taxonomy as implemented on Ahotsak.eus. The official Ahotsak municipality→ azpieuskalki mapping provides the ground truth labels for sub-dialect classification.

Models

Euskalki (Dialect) Classification — 5 euskalkis + batua (6-class)

Hierarchical 2-step classifier (binary batua/dialectal → 5-class euskalkiak):

Variant Filename Size XNLI (3-class) Test (4-class) Batua F1
final hier_binary_final.bin + hier_dialect_final.bin 1.5GB 92.42% 95.18% 0.962
quantized hier_*_quantized.bin 417MB 92.38% 95.16% 0.961
compact hier_*_compact.bin 189MB 91.78% 94.71% 0.957
tiny hier_*_tiny.bin 112MB 91.90% 94.88% 0.961
web hier_binary_web.bin + hier_dialect_web.bin 32MB 91.06% 94.33% 0.952

Per-class F1 (final): Western 0.953, Central 0.933, Nav-Lab 0.949, Batua 0.962.

Azpieuskalki (Sub-Dialect) Classification — 9 to 12-class

Fine-grained sub-dialect classifier trained on Ahotsak.eus oral history transcriptions.

Variant Filename Classes Accuracy
12-class (all) azpieuskalki.bin (233MB) 12 82.08%
9-class (min_samples=600) azpieuskalki.bin (233MB) 9 83.55%

Compression variants:

Variant Filename Accuracy Size vs original
original azpieuskalki.bin 83.59% 233MB baseline
quantized azpieuskalki_q.bin 83.28% 31MB -0.31pp, 7.5× smaller
bucket=50K azpieuskalki_b50000.bin 83.22% 119MB -0.37pp, 2× smaller
bucket=50K Q azpieuskalki_b50000_q.bin 82.80% 17MB -0.79pp, 13.7× smaller

Usage

import fasttext

# Load a model
model = fasttext.load_model("azpieuskalki.bin")

# Predict
text = "Eta ehuna etxian eitten zan?"
labels, probs = model.predict(text, k=3)
print(labels[0].replace("__label__", ""), probs[0])

Or use the zeineuski CLI from the source repo:

uv run zeineuski predict --text "Gaur goizean goiz jaiki naiz"

Web Demo

Try it in your browser — no server, no install:

itzune.eus/euskalkid (source)

34MB of fastText models running via WebAssembly. Works offline after first load.

Training

Optimal hyperparameters (discovered via pi-autoresearch, 37 experiments over 3 sessions):

Euskalki binary + 5-class:

# Step 1 — batua vs dialectal
fasttext supervised -input train_binary.txt -output hier_binary \
  -lr 3.0 -epoch 50 -dim 100 -minn 3 -maxn 6 -wordNgrams 2

# Step 2 — 5 euskalkis (no batua)
fasttext supervised -input train_dialectal_5class.txt -output hier_dialect \
  -lr 0.2 -epoch 150 -dim 100 -minn 3 -maxn 6 -wordNgrams 2

Azpieuskalki 9-class:

fasttext supervised -input train_azpieuskalki.txt -output azpieuskalki \
  -dim 200 -lr 0.2 -epoch 75 -wordNgrams 2 -minn 2 -maxn 6 -loss ns

Key insight: NO autotune — aggressive LR decay overfits to dominant classes. Character n-grams (minn=2,maxn=6) capture Basque morphological patterns (case endings, verb suffixes) that are dialect-specific (+9.4pp improvement).

License

MIT

Downloads last month
809
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support