Zeuneuski Audio — Basque Dialect Classifier from Speech

5-class Basque dialect classifier (Western, Central, Navarrese, Navarrese-Labourdin, Souletin) using a frozen Whisper large-v3-eu encoder + MLP classifier.

This is the speech counterpart of the zeuneuski text classifier.

Model variants

Variant	Macro F1	Trained on	Description
`whisper_dialect_merged`	0.5193	Full merged Ahotsak+Mintzoak (balanced 10K)	Baseline — mean_std_max pooling, 768-dim MLP
`whisper_dialect_aug`	0.5342	Full merged + navarrese augmentation ×3	Best overall — embedding-level augmentation
`whisper_dialect_fusion`	0.6175	Ahotsak subset (21% with transcriptions)	Audio+text fusion (Whisper + fastText logits). Limited to Ahotsak data.

Per-class F1 (best model: whisper_dialect_aug)

Dialect	F1
Western	0.70
Central	0.34
Navarrese	0.38
Navarrese-Labourdin	0.83
Souletin	0.42

How it works

Audio (16kHz mono WAV) → Whisper large-v3-eu encoder
Encoder hidden states → mean_std_max pooling → 3840-dim vector
3840-dim vector → 2-layer MLP (768→384→5) → dialect probabilities

Requirements

GPU with 6+ GB VRAM (runs on CPU too, ~8-10× slower)
transformers, torch, numpy, soundfile
Whisper model auto-downloaded from xezpeleta/whisper-large-v3-eu

Usage

from src.models.speech.whisper_did import load_speech_model, predict_speech

# Load model (downloads Whisper encoder automatically)
encoder, mlp, label_encoder, scaler, config = load_speech_model(
    model_dir="models/speech/whisper_dialect_aug"
)

# Predict
result = predict_speech("audio.wav", encoder, mlp, label_encoder, scaler, config)
print(result["dialect"], result["confidence"])

Training data

Merged Ahotsak.eus (36K segments, 78h) + Mintzoak.eus (160K segments, 181h). Town-disjoint 80/10/10 train/val/test splits (no town appears in more than one split). Balanced subsampling to 10K per class. 5 classes with 258.9h total audio.

Downloads last month: -; Downloads are not tracked for this model. How to track