YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Whisper-large-v3 Arabic Dialect Identification (15s)
Model Description
This model identifies 17 regional Arabic dialects from short speech recordings. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset β a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.
The system takes up to 15 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.
Supported Dialects
| Code | Country | Code | Country | Code | Country |
|---|---|---|---|---|---|
| ALG | Algeria | KUW | Kuwait | QAT | Qatar |
| EGY | Egypt | LEB | Lebanon | SUD | Sudan |
| IRA | Iraq | LIB | Libya | SYR | Syria |
| JOR | Jordan | MAU | Mauritania | UAE | UAE |
| KSA | Saudi Arabia | MOR | Morocco | YEM | Yemen |
| OMA | Oman | PAL | Palestine |
Architecture
Input Audio (up to 15s)
β
WhisperFeatureExtractor β 80-channel log-mel spectrogram (80 Γ 3000)
β
Whisper-large-v3 Encoder (32 Transformer layers)
βββ Layers 1β24 : Frozen (low-level acoustic features preserved)
βββ Layers 25β32: Fine-tuned (high-level dialect-specific features)
β
Attention Pooling β learned weighted average over 1500 time frames β 1280-dim vector
β
MLP Classifier:
LayerNorm β Dropout(0.15)
β Linear(1280 β 512) β GELU
β Dropout(0.15)
β Linear(512 β 256) β GELU
β Dropout(0.075)
β Linear(256 β 17) β dialect probabilities
Training Details
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Dataset | ADI17 (17 Arabic dialect classes) |
| Audio window | 15 seconds, non-overlapping |
| Frozen layers | First 24 of 32 encoder layers |
| Optimizer | AdamW (lr=2e-5, weight decay=1e-2) |
| LR schedule | Cosine with 10% linear warmup |
| Batch size | 32 Γ 4 gradient accumulation = 128 effective |
| Precision | AMP float16 |
| Augmentation | SpecAugment (2 frequency + 2 time masks) |
| Loss | CrossEntropy + label smoothing (0.1) + inverse-frequency class weights |
| Early stopping | Patience = 4 epochs |
| Hardware | NVIDIA A100 |
| Training epochs | 30 |
Results
| Metric | Score |
|---|---|
| Test Accuracy | 92.58% |
| Macro F1 | 91.51% |
| Weighted F1 | 92.65% |
Per-Dialect F1 Scores
| Dialect | F1 | Dialect | F1 | Dialect | F1 |
|---|---|---|---|---|---|
| ALG | 0.9498 | KUW | 0.8597 | QAT | 0.9503 |
| EGY | 0.8892 | LEB | 0.9644 | SUD | 0.9184 |
| IRA | 0.9111 | LIB | 0.8575 | SYR | 0.8040 |
| JOR | 0.9574 | MAU | 0.8965 | UAE | 0.9244 |
| KSA | 0.9488 | MOR | 0.9641 | YEM | 0.8793 |
| OMA | 0.9303 | PAL | 0.9509 |
Best dialect: Lebanese (LEB) at 96.44% F1
Hardest dialect: Syrian (SYR) at 80.40% F1 β frequently confused with neighboring Levantine dialects
Data Preprocessing Pipeline
Raw .wav audio files were preprocessed as follows:
- Converted to mono and resampled to 16 kHz
- Segmented into non-overlapping 15-second windows (zero-padded if shorter)
- Maximum 50 windows per recording to prevent long files from dominating
- Each window converted to an 80-channel log-mel spectrogram using
WhisperFeatureExtractor - Features saved as float16 in compressed HDF5 files for fast training I/O
Training set was stratified and capped at 12,000 samples per class, yielding approximately 180,000 training windows total.
How to Use
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download
# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-15s")
# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
# Define architecture (must match training)
class AttentionPool(nn.Module):
def __init__(self, d_model):
super().__init__()
self.attn = nn.Linear(d_model, 1, bias=False)
def forward(self, x):
weights = F.softmax(self.attn(x), dim=1)
return (x * weights).sum(dim=1)
class WhisperDialectClassifier(nn.Module):
def __init__(self, whisper_model, num_classes=17, dropout=0.15):
super().__init__()
self.whisper = whisper_model
hs = self.whisper.config.d_model
self.pool = AttentionPool(hs)
self.classifier = nn.Sequential(
nn.LayerNorm(hs), nn.Dropout(dropout),
nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
nn.Linear(256, num_classes)
)
def forward(self, x):
h = self.whisper.encoder(input_features=x).last_hidden_state
return self.classifier(self.pool(h))
# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()
# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 15*16000 - len(audio))))[:15*16000]
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]
ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")
Limitations
- Performance is lower on Syrian (SYR) dialect due to its acoustic similarity to other Levantine dialects (Lebanese, Palestinian, Jordanian)
- Model was trained on broadcast/YouTube speech β performance may degrade on conversational, noisy, or phone-quality audio
- Audio shorter than 3 seconds is not supported
- Only the encoder of Whisper-large-v3 is used β the decoder (transcription) is not activated
Citation
If you use this model in your research, please cite:
@misc{mesoj8_arabic_dialect_2026,
author = {Meso},
title = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 17 classes)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-15s}
}
- Downloads last month
- 2