Whisper-large-v3 Arabic Dialect Identification (15s)

Model Description

This model identifies 17 regional Arabic dialects from short speech recordings. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset — a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.

The system takes up to 15 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.

Supported Dialects

Code	Country	Code	Country	Code	Country
ALG	Algeria	KUW	Kuwait	QAT	Qatar
EGY	Egypt	LEB	Lebanon	SUD	Sudan
IRA	Iraq	LIB	Libya	SYR	Syria
JOR	Jordan	MAU	Mauritania	UAE	UAE
KSA	Saudi Arabia	MOR	Morocco	YEM	Yemen
		OMA	Oman	PAL	Palestine

Architecture

Input Audio (up to 15s)
        ↓
WhisperFeatureExtractor → 80-channel log-mel spectrogram (80 × 3000)
        ↓
Whisper-large-v3 Encoder (32 Transformer layers)
  ├── Layers 1–24 : Frozen (low-level acoustic features preserved)
  └── Layers 25–32: Fine-tuned (high-level dialect-specific features)
        ↓
Attention Pooling → learned weighted average over 1500 time frames → 1280-dim vector
        ↓
MLP Classifier:
  LayerNorm → Dropout(0.15)
  → Linear(1280 → 512) → GELU
  → Dropout(0.15)
  → Linear(512 → 256) → GELU
  → Dropout(0.075)
  → Linear(256 → 17) → dialect probabilities

Training Details

Parameter	Value
Base model	openai/whisper-large-v3
Dataset	ADI17 (17 Arabic dialect classes)
Audio window	15 seconds, non-overlapping
Frozen layers	First 24 of 32 encoder layers
Optimizer	AdamW (lr=2e-5, weight decay=1e-2)
LR schedule	Cosine with 10% linear warmup
Batch size	32 × 4 gradient accumulation = 128 effective
Precision	AMP float16
Augmentation	SpecAugment (2 frequency + 2 time masks)
Loss	CrossEntropy + label smoothing (0.1) + inverse-frequency class weights
Early stopping	Patience = 4 epochs
Hardware	NVIDIA A100
Training epochs	30

Results

Metric	Score
Test Accuracy	92.58%
Macro F1	91.51%
Weighted F1	92.65%

Per-Dialect F1 Scores

Dialect	F1	Dialect	F1	Dialect	F1
ALG	0.9498	KUW	0.8597	QAT	0.9503
EGY	0.8892	LEB	0.9644	SUD	0.9184
IRA	0.9111	LIB	0.8575	SYR	0.8040
JOR	0.9574	MAU	0.8965	UAE	0.9244
KSA	0.9488	MOR	0.9641	YEM	0.8793
		OMA	0.9303	PAL	0.9509

Best dialect: Lebanese (LEB) at 96.44% F1
Hardest dialect: Syrian (SYR) at 80.40% F1 — frequently confused with neighboring Levantine dialects

Data Preprocessing Pipeline

Raw .wav audio files were preprocessed as follows:

Converted to mono and resampled to 16 kHz
Segmented into non-overlapping 15-second windows (zero-padded if shorter)
Maximum 50 windows per recording to prevent long files from dominating
Each window converted to an 80-channel log-mel spectrogram using WhisperFeatureExtractor
Features saved as float16 in compressed HDF5 files for fast training I/O

Training set was stratified and capped at 12,000 samples per class, yielding approximately 180,000 training windows total.

How to Use

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download

# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-15s")

# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)

# Define architecture (must match training)
class AttentionPool(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.attn = nn.Linear(d_model, 1, bias=False)
    def forward(self, x):
        weights = F.softmax(self.attn(x), dim=1)
        return (x * weights).sum(dim=1)

class WhisperDialectClassifier(nn.Module):
    def __init__(self, whisper_model, num_classes=17, dropout=0.15):
        super().__init__()
        self.whisper = whisper_model
        hs = self.whisper.config.d_model
        self.pool = AttentionPool(hs)
        self.classifier = nn.Sequential(
            nn.LayerNorm(hs), nn.Dropout(dropout),
            nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        h = self.whisper.encoder(input_features=x).last_hidden_state
        return self.classifier(self.pool(h))

# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()

# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 15*16000 - len(audio))))[:15*16000]

inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
    probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]

ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
            8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")

Limitations

Performance is lower on Syrian (SYR) dialect due to its acoustic similarity to other Levantine dialects (Lebanese, Palestinian, Jordanian)
Model was trained on broadcast/YouTube speech — performance may degrade on conversational, noisy, or phone-quality audio
Audio shorter than 3 seconds is not supported
Only the encoder of Whisper-large-v3 is used — the decoder (transcription) is not activated

Citation

If you use this model in your research, please cite:

@misc{mesoj8_arabic_dialect_2026,
  author    = {Meso},
  title     = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 17 classes)},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-15s}
}

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mesoj8
/

whisper-large-v3-arabic-dialect-15s