Zeuneuski Audio β Basque Dialect Classifier from Speech
5-class Basque dialect classifier (Western, Central, Navarrese, Navarrese-Labourdin, Souletin) using a frozen Whisper large-v3-eu encoder + MLP classifier.
This is the speech counterpart of the zeuneuski text classifier.
Model variants
| Variant | Macro F1 | Trained on | Description |
|---|---|---|---|
whisper_dialect_merged |
0.5193 | Full merged Ahotsak+Mintzoak (balanced 10K) | Baseline β mean_std_max pooling, 768-dim MLP |
whisper_dialect_aug |
0.5342 | Full merged + navarrese augmentation Γ3 | Best overall β embedding-level augmentation |
whisper_dialect_fusion |
0.6175 | Ahotsak subset (21% with transcriptions) | Audio+text fusion (Whisper + fastText logits). Limited to Ahotsak data. |
Per-class F1 (best model: whisper_dialect_aug)
| Dialect | F1 |
|---|---|
| Western | 0.70 |
| Central | 0.34 |
| Navarrese | 0.38 |
| Navarrese-Labourdin | 0.83 |
| Souletin | 0.42 |
How it works
- Audio (16kHz mono WAV) β Whisper large-v3-eu encoder
- Encoder hidden states β mean_std_max pooling β 3840-dim vector
- 3840-dim vector β 2-layer MLP (768β384β5) β dialect probabilities
Requirements
- GPU with 6+ GB VRAM (runs on CPU too, ~8-10Γ slower)
transformers,torch,numpy,soundfile- Whisper model auto-downloaded from
xezpeleta/whisper-large-v3-eu
Usage
from src.models.speech.whisper_did import load_speech_model, predict_speech
# Load model (downloads Whisper encoder automatically)
encoder, mlp, label_encoder, scaler, config = load_speech_model(
model_dir="models/speech/whisper_dialect_aug"
)
# Predict
result = predict_speech("audio.wav", encoder, mlp, label_encoder, scaler, config)
print(result["dialect"], result["confidence"])
Training data
Merged Ahotsak.eus (36K segments, 78h) + Mintzoak.eus (160K segments, 181h). Town-disjoint 80/10/10 train/val/test splits (no town appears in more than one split). Balanced subsampling to 10K per class. 5 classes with 258.9h total audio.