YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Whisper-large-v3 Arabic Dialect Identification (30s)

Model Description

This model identifies 17 regional Arabic dialects from speech recordings of up to 30 seconds. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset β€” a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.

The system takes up to 30 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.

Compared to the 15s version, this model uses longer audio windows during training, giving the encoder more speech context per sample and achieving higher accuracy (95.11% vs 92.58%).


Supported Dialects

Code Country Code Country Code Country
ALG Algeria KUW Kuwait QAT Qatar
EGY Egypt LEB Lebanon SUD Sudan
IRA Iraq LIB Libya SYR Syria
JOR Jordan MAU Mauritania UAE UAE
KSA Saudi Arabia MOR Morocco YEM Yemen
OMA Oman PAL Palestine

Architecture

Input Audio (up to 30s)
        ↓
WhisperFeatureExtractor β†’ 80-channel log-mel spectrogram (80 Γ— 3000)
        ↓
Whisper-large-v3 Encoder (32 Transformer layers)
  β”œβ”€β”€ Layers 1–24 : Frozen (low-level acoustic features preserved)
  └── Layers 25–32: Fine-tuned (high-level dialect-specific features)
        ↓
Attention Pooling β†’ learned weighted average over 1500 time frames β†’ 1280-dim vector
        ↓
MLP Classifier Head:
  LayerNorm β†’ Dropout(0.15)
  β†’ Linear(1280 β†’ 512) β†’ GELU
  β†’ Dropout(0.15)
  β†’ Linear(512 β†’ 256) β†’ GELU
  β†’ Dropout(0.075)
  β†’ Linear(256 β†’ 17) β†’ dialect probabilities

Training Details

Parameter Value
Base model openai/whisper-large-v3
Dataset ADI17 (17 Arabic dialect classes)
Audio window 30 seconds, non-overlapping
Frozen layers First 24 of 32 encoder layers
Optimizer AdamW (lr=2e-5, weight decay=1e-2)
LR schedule Cosine with 10% linear warmup
Batch size 32 Γ— 4 gradient accumulation = 128 effective
Precision AMP float16
Augmentation SpecAugment (2 frequency + 2 time masks)
Loss CrossEntropy + label smoothing (0.1) + inverse-frequency class weights
Max per class 8,000 training samples
Early stopping Patience = 4 epochs (stopped at epoch 15)
Hardware NVIDIA A100

Results

Metric Score
Test Accuracy 95.11%
Macro F1 94.16%
Weighted F1 95.14%

Per-Dialect F1 Scores

Dialect F1 Dialect F1 Dialect F1
ALG 95.73% KUW 90.94% QAT 97.32%
EGY 93.58% LEB 97.81% SUD 93.72%
IRA 94.12% LIB 87.16% SYR 88.78%
JOR 96.62% MAU 90.96% UAE 95.33%
KSA 96.22% MOR 97.54% YEM 92.16%
PAL 96.91% OMA 95.82%

Best dialect: Lebanese (LEB) at 97.81% F1 Hardest dialect: Libyan (LIB) at 87.16% F1

Comparison with 15s Model

Metric 15s Model 30s Model Improvement
Test Accuracy 92.58% 95.11% +2.53%
Macro F1 91.51% 94.16% +2.65%
Weighted F1 92.65% 95.14% +2.49%

The 30s model consistently outperforms the 15s model across all dialects, as longer audio windows provide more phonetic, prosodic, and lexical context for dialect identification.


Data Preprocessing Pipeline

Raw .wav audio files were preprocessed as follows:

  1. Converted to mono and resampled to 16 kHz
  2. Segmented into non-overlapping 30-second windows (zero-padded if shorter)
  3. Maximum 10 windows per recording to prevent long files from dominating
  4. Each window converted to an 80-channel log-mel spectrogram using WhisperFeatureExtractor
  5. Features saved as float16 in compressed HDF5 files for fast training I/O

Training set was stratified and capped at 8,000 samples per class.


How to Use

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download

# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-30s")

# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)

# Define architecture (must match training)
class AttentionPool(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.attn = nn.Linear(d_model, 1, bias=False)
    def forward(self, x):
        weights = F.softmax(self.attn(x), dim=1)
        return (x * weights).sum(dim=1)

class WhisperDialectClassifier(nn.Module):
    def __init__(self, whisper_model, num_classes=17, dropout=0.15):
        super().__init__()
        self.whisper = whisper_model
        hs = self.whisper.config.d_model
        self.pool = AttentionPool(hs)
        self.classifier = nn.Sequential(
            nn.LayerNorm(hs), nn.Dropout(dropout),
            nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        h = self.whisper.encoder(input_features=x).last_hidden_state
        return self.classifier(self.pool(h))

# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()

# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
target = 30 * 16000  # 480000 samples
audio = np.pad(audio, (0, max(0, target - len(audio))))[:target]

inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
    probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]

ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
            8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")

Limitations

  • Libyan (LIB) and Syrian (SYR) dialect performance is lower due to acoustic overlap with neighboring dialects
  • The model was trained on broadcast and YouTube speech β€” performance may degrade on conversational, noisy, or phone-quality audio
  • Code-switching (mixing Arabic with other languages) may reduce accuracy
  • Audio shorter than 3 seconds is not supported
  • Only the Whisper encoder is used β€” the decoder (transcription) is not activated

Related Models

Model Window Accuracy Repo
Whisper-large-v3 ADI17 15s 15s 92.58% mesoj8/whisper-large-v3-arabic-dialect-15s
Whisper-large-v3 ADI17 30s 30s 95.11% This model

Citation

@misc{mesoj8_arabic_dialect_30s_2026,
  author    = {Meso},
  title     = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 30s windows)},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-30s}
}
Downloads last month
6
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using mesoj8/whisper-large-v3-arabic-dialect-30s 1