YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Whisper-large-v3 Arabic Dialect Identification (15s)

Model Description

This model identifies 17 regional Arabic dialects from short speech recordings. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset β€” a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.

The system takes up to 15 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.


Supported Dialects

Code Country Code Country Code Country
ALG Algeria KUW Kuwait QAT Qatar
EGY Egypt LEB Lebanon SUD Sudan
IRA Iraq LIB Libya SYR Syria
JOR Jordan MAU Mauritania UAE UAE
KSA Saudi Arabia MOR Morocco YEM Yemen
OMA Oman PAL Palestine

Architecture

Input Audio (up to 15s)
        ↓
WhisperFeatureExtractor β†’ 80-channel log-mel spectrogram (80 Γ— 3000)
        ↓
Whisper-large-v3 Encoder (32 Transformer layers)
  β”œβ”€β”€ Layers 1–24 : Frozen (low-level acoustic features preserved)
  └── Layers 25–32: Fine-tuned (high-level dialect-specific features)
        ↓
Attention Pooling β†’ learned weighted average over 1500 time frames β†’ 1280-dim vector
        ↓
MLP Classifier:
  LayerNorm β†’ Dropout(0.15)
  β†’ Linear(1280 β†’ 512) β†’ GELU
  β†’ Dropout(0.15)
  β†’ Linear(512 β†’ 256) β†’ GELU
  β†’ Dropout(0.075)
  β†’ Linear(256 β†’ 17) β†’ dialect probabilities

Training Details

Parameter Value
Base model openai/whisper-large-v3
Dataset ADI17 (17 Arabic dialect classes)
Audio window 15 seconds, non-overlapping
Frozen layers First 24 of 32 encoder layers
Optimizer AdamW (lr=2e-5, weight decay=1e-2)
LR schedule Cosine with 10% linear warmup
Batch size 32 Γ— 4 gradient accumulation = 128 effective
Precision AMP float16
Augmentation SpecAugment (2 frequency + 2 time masks)
Loss CrossEntropy + label smoothing (0.1) + inverse-frequency class weights
Early stopping Patience = 4 epochs
Hardware NVIDIA A100
Training epochs 30

Results

Metric Score
Test Accuracy 92.58%
Macro F1 91.51%
Weighted F1 92.65%

Per-Dialect F1 Scores

Dialect F1 Dialect F1 Dialect F1
ALG 0.9498 KUW 0.8597 QAT 0.9503
EGY 0.8892 LEB 0.9644 SUD 0.9184
IRA 0.9111 LIB 0.8575 SYR 0.8040
JOR 0.9574 MAU 0.8965 UAE 0.9244
KSA 0.9488 MOR 0.9641 YEM 0.8793
OMA 0.9303 PAL 0.9509

Best dialect: Lebanese (LEB) at 96.44% F1
Hardest dialect: Syrian (SYR) at 80.40% F1 β€” frequently confused with neighboring Levantine dialects


Data Preprocessing Pipeline

Raw .wav audio files were preprocessed as follows:

  1. Converted to mono and resampled to 16 kHz
  2. Segmented into non-overlapping 15-second windows (zero-padded if shorter)
  3. Maximum 50 windows per recording to prevent long files from dominating
  4. Each window converted to an 80-channel log-mel spectrogram using WhisperFeatureExtractor
  5. Features saved as float16 in compressed HDF5 files for fast training I/O

Training set was stratified and capped at 12,000 samples per class, yielding approximately 180,000 training windows total.


How to Use

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download

# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-15s")

# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)

# Define architecture (must match training)
class AttentionPool(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.attn = nn.Linear(d_model, 1, bias=False)
    def forward(self, x):
        weights = F.softmax(self.attn(x), dim=1)
        return (x * weights).sum(dim=1)

class WhisperDialectClassifier(nn.Module):
    def __init__(self, whisper_model, num_classes=17, dropout=0.15):
        super().__init__()
        self.whisper = whisper_model
        hs = self.whisper.config.d_model
        self.pool = AttentionPool(hs)
        self.classifier = nn.Sequential(
            nn.LayerNorm(hs), nn.Dropout(dropout),
            nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        h = self.whisper.encoder(input_features=x).last_hidden_state
        return self.classifier(self.pool(h))

# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()

# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = np.pad(audio, (0, max(0, 15*16000 - len(audio))))[:15*16000]

inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
    probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]

ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
            8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")

Limitations

  • Performance is lower on Syrian (SYR) dialect due to its acoustic similarity to other Levantine dialects (Lebanese, Palestinian, Jordanian)
  • Model was trained on broadcast/YouTube speech β€” performance may degrade on conversational, noisy, or phone-quality audio
  • Audio shorter than 3 seconds is not supported
  • Only the encoder of Whisper-large-v3 is used β€” the decoder (transcription) is not activated

Citation

If you use this model in your research, please cite:

@misc{mesoj8_arabic_dialect_2026,
  author    = {Meso},
  title     = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 17 classes)},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-15s}
}
Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using mesoj8/whisper-large-v3-arabic-dialect-15s 1