YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Whisper-large-v3 Arabic Dialect Identification (30s)
Model Description
This model identifies 17 regional Arabic dialects from speech recordings of up to 30 seconds. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset β a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.
The system takes up to 30 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.
Compared to the 15s version, this model uses longer audio windows during training, giving the encoder more speech context per sample and achieving higher accuracy (95.11% vs 92.58%).
Supported Dialects
| Code | Country | Code | Country | Code | Country |
|---|---|---|---|---|---|
| ALG | Algeria | KUW | Kuwait | QAT | Qatar |
| EGY | Egypt | LEB | Lebanon | SUD | Sudan |
| IRA | Iraq | LIB | Libya | SYR | Syria |
| JOR | Jordan | MAU | Mauritania | UAE | UAE |
| KSA | Saudi Arabia | MOR | Morocco | YEM | Yemen |
| OMA | Oman | PAL | Palestine |
Architecture
Input Audio (up to 30s)
β
WhisperFeatureExtractor β 80-channel log-mel spectrogram (80 Γ 3000)
β
Whisper-large-v3 Encoder (32 Transformer layers)
βββ Layers 1β24 : Frozen (low-level acoustic features preserved)
βββ Layers 25β32: Fine-tuned (high-level dialect-specific features)
β
Attention Pooling β learned weighted average over 1500 time frames β 1280-dim vector
β
MLP Classifier Head:
LayerNorm β Dropout(0.15)
β Linear(1280 β 512) β GELU
β Dropout(0.15)
β Linear(512 β 256) β GELU
β Dropout(0.075)
β Linear(256 β 17) β dialect probabilities
Training Details
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Dataset | ADI17 (17 Arabic dialect classes) |
| Audio window | 30 seconds, non-overlapping |
| Frozen layers | First 24 of 32 encoder layers |
| Optimizer | AdamW (lr=2e-5, weight decay=1e-2) |
| LR schedule | Cosine with 10% linear warmup |
| Batch size | 32 Γ 4 gradient accumulation = 128 effective |
| Precision | AMP float16 |
| Augmentation | SpecAugment (2 frequency + 2 time masks) |
| Loss | CrossEntropy + label smoothing (0.1) + inverse-frequency class weights |
| Max per class | 8,000 training samples |
| Early stopping | Patience = 4 epochs (stopped at epoch 15) |
| Hardware | NVIDIA A100 |
Results
| Metric | Score |
|---|---|
| Test Accuracy | 95.11% |
| Macro F1 | 94.16% |
| Weighted F1 | 95.14% |
Per-Dialect F1 Scores
| Dialect | F1 | Dialect | F1 | Dialect | F1 |
|---|---|---|---|---|---|
| ALG | 95.73% | KUW | 90.94% | QAT | 97.32% |
| EGY | 93.58% | LEB | 97.81% | SUD | 93.72% |
| IRA | 94.12% | LIB | 87.16% | SYR | 88.78% |
| JOR | 96.62% | MAU | 90.96% | UAE | 95.33% |
| KSA | 96.22% | MOR | 97.54% | YEM | 92.16% |
| PAL | 96.91% | OMA | 95.82% |
Best dialect: Lebanese (LEB) at 97.81% F1 Hardest dialect: Libyan (LIB) at 87.16% F1
Comparison with 15s Model
| Metric | 15s Model | 30s Model | Improvement |
|---|---|---|---|
| Test Accuracy | 92.58% | 95.11% | +2.53% |
| Macro F1 | 91.51% | 94.16% | +2.65% |
| Weighted F1 | 92.65% | 95.14% | +2.49% |
The 30s model consistently outperforms the 15s model across all dialects, as longer audio windows provide more phonetic, prosodic, and lexical context for dialect identification.
Data Preprocessing Pipeline
Raw .wav audio files were preprocessed as follows:
- Converted to mono and resampled to 16 kHz
- Segmented into non-overlapping 30-second windows (zero-padded if shorter)
- Maximum 10 windows per recording to prevent long files from dominating
- Each window converted to an 80-channel log-mel spectrogram using
WhisperFeatureExtractor - Features saved as float16 in compressed HDF5 files for fast training I/O
Training set was stratified and capped at 8,000 samples per class.
How to Use
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download
# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-30s")
# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
# Define architecture (must match training)
class AttentionPool(nn.Module):
def __init__(self, d_model):
super().__init__()
self.attn = nn.Linear(d_model, 1, bias=False)
def forward(self, x):
weights = F.softmax(self.attn(x), dim=1)
return (x * weights).sum(dim=1)
class WhisperDialectClassifier(nn.Module):
def __init__(self, whisper_model, num_classes=17, dropout=0.15):
super().__init__()
self.whisper = whisper_model
hs = self.whisper.config.d_model
self.pool = AttentionPool(hs)
self.classifier = nn.Sequential(
nn.LayerNorm(hs), nn.Dropout(dropout),
nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
nn.Linear(256, num_classes)
)
def forward(self, x):
h = self.whisper.encoder(input_features=x).last_hidden_state
return self.classifier(self.pool(h))
# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()
# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
target = 30 * 16000 # 480000 samples
audio = np.pad(audio, (0, max(0, target - len(audio))))[:target]
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]
ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")
Limitations
- Libyan (LIB) and Syrian (SYR) dialect performance is lower due to acoustic overlap with neighboring dialects
- The model was trained on broadcast and YouTube speech β performance may degrade on conversational, noisy, or phone-quality audio
- Code-switching (mixing Arabic with other languages) may reduce accuracy
- Audio shorter than 3 seconds is not supported
- Only the Whisper encoder is used β the decoder (transcription) is not activated
Related Models
| Model | Window | Accuracy | Repo |
|---|---|---|---|
| Whisper-large-v3 ADI17 15s | 15s | 92.58% | mesoj8/whisper-large-v3-arabic-dialect-15s |
| Whisper-large-v3 ADI17 30s | 30s | 95.11% | This model |
Citation
@misc{mesoj8_arabic_dialect_30s_2026,
author = {Meso},
title = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 30s windows)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-30s}
}
- Downloads last month
- 6