Whisper-large-v3 Arabic Dialect Identification (30s)

Model Description

This model identifies 17 regional Arabic dialects from speech recordings of up to 30 seconds. It is built on top of OpenAI's Whisper-large-v3 encoder, fine-tuned on the ADI17 dataset — a large-scale Arabic dialect corpus sourced from YouTube broadcasts covering 17 Arab countries.

The system takes up to 30 seconds of raw audio, converts it into an 80-channel log-mel spectrogram, passes it through the fine-tuned Whisper encoder, and produces a probability distribution over 17 dialect classes using a custom attention pooling mechanism and a multi-layer classification head.

Compared to the 15s version, this model uses longer audio windows during training, giving the encoder more speech context per sample and achieving higher accuracy (95.11% vs 92.58%).

Supported Dialects

Code	Country	Code	Country	Code	Country
ALG	Algeria	KUW	Kuwait	QAT	Qatar
EGY	Egypt	LEB	Lebanon	SUD	Sudan
IRA	Iraq	LIB	Libya	SYR	Syria
JOR	Jordan	MAU	Mauritania	UAE	UAE
KSA	Saudi Arabia	MOR	Morocco	YEM	Yemen
		OMA	Oman	PAL	Palestine

Architecture

Input Audio (up to 30s)
        ↓
WhisperFeatureExtractor → 80-channel log-mel spectrogram (80 × 3000)
        ↓
Whisper-large-v3 Encoder (32 Transformer layers)
  ├── Layers 1–24 : Frozen (low-level acoustic features preserved)
  └── Layers 25–32: Fine-tuned (high-level dialect-specific features)
        ↓
Attention Pooling → learned weighted average over 1500 time frames → 1280-dim vector
        ↓
MLP Classifier Head:
  LayerNorm → Dropout(0.15)
  → Linear(1280 → 512) → GELU
  → Dropout(0.15)
  → Linear(512 → 256) → GELU
  → Dropout(0.075)
  → Linear(256 → 17) → dialect probabilities

Training Details

Parameter	Value
Base model	openai/whisper-large-v3
Dataset	ADI17 (17 Arabic dialect classes)
Audio window	30 seconds, non-overlapping
Frozen layers	First 24 of 32 encoder layers
Optimizer	AdamW (lr=2e-5, weight decay=1e-2)
LR schedule	Cosine with 10% linear warmup
Batch size	32 × 4 gradient accumulation = 128 effective
Precision	AMP float16
Augmentation	SpecAugment (2 frequency + 2 time masks)
Loss	CrossEntropy + label smoothing (0.1) + inverse-frequency class weights
Max per class	8,000 training samples
Early stopping	Patience = 4 epochs (stopped at epoch 15)
Hardware	NVIDIA A100

Results

Metric	Score
Test Accuracy	95.11%
Macro F1	94.16%
Weighted F1	95.14%

Per-Dialect F1 Scores

Dialect	F1	Dialect	F1	Dialect	F1
ALG	95.73%	KUW	90.94%	QAT	97.32%
EGY	93.58%	LEB	97.81%	SUD	93.72%
IRA	94.12%	LIB	87.16%	SYR	88.78%
JOR	96.62%	MAU	90.96%	UAE	95.33%
KSA	96.22%	MOR	97.54%	YEM	92.16%
PAL	96.91%	OMA	95.82%

Best dialect: Lebanese (LEB) at 97.81% F1 Hardest dialect: Libyan (LIB) at 87.16% F1

Comparison with 15s Model

Metric	15s Model	30s Model	Improvement
Test Accuracy	92.58%	95.11%	+2.53%
Macro F1	91.51%	94.16%	+2.65%
Weighted F1	92.65%	95.14%	+2.49%

The 30s model consistently outperforms the 15s model across all dialects, as longer audio windows provide more phonetic, prosodic, and lexical context for dialect identification.

Data Preprocessing Pipeline

Raw .wav audio files were preprocessed as follows:

Converted to mono and resampled to 16 kHz
Segmented into non-overlapping 30-second windows (zero-padded if shorter)
Maximum 10 windows per recording to prevent long files from dominating
Each window converted to an 80-channel log-mel spectrogram using WhisperFeatureExtractor
Features saved as float16 in compressed HDF5 files for fast training I/O

Training set was stratified and capped at 8,000 samples per class.

How to Use

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import librosa
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import snapshot_download

# Download model
model_dir = snapshot_download("mesoj8/whisper-large-v3-arabic-dialect-30s")

# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)

# Define architecture (must match training)
class AttentionPool(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.attn = nn.Linear(d_model, 1, bias=False)
    def forward(self, x):
        weights = F.softmax(self.attn(x), dim=1)
        return (x * weights).sum(dim=1)

class WhisperDialectClassifier(nn.Module):
    def __init__(self, whisper_model, num_classes=17, dropout=0.15):
        super().__init__()
        self.whisper = whisper_model
        hs = self.whisper.config.d_model
        self.pool = AttentionPool(hs)
        self.classifier = nn.Sequential(
            nn.LayerNorm(hs), nn.Dropout(dropout),
            nn.Linear(hs, 512), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.GELU(), nn.Dropout(dropout / 2),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        h = self.whisper.encoder(input_features=x).last_hidden_state
        return self.classifier(self.pool(h))

# Load model
device = "cpu"
whisper_base = WhisperModel.from_pretrained(model_dir)
model = WhisperDialectClassifier(whisper_base).to(device)
model.pool.load_state_dict(torch.load(f"{model_dir}/pool.pth", map_location=device, weights_only=True))
model.classifier.load_state_dict(torch.load(f"{model_dir}/classifier.pth", map_location=device, weights_only=True))
model.eval()

# Inference
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
target = 30 * 16000  # 480000 samples
audio = np.pad(audio, (0, max(0, target - len(audio))))[:target]

inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.inference_mode():
    probs = torch.softmax(model(inputs["input_features"]), dim=-1)[0]

ID2LABEL = {0:"ALG",1:"EGY",2:"IRA",3:"JOR",4:"KSA",5:"KUW",6:"LEB",7:"LIB",
            8:"MAU",9:"MOR",10:"OMA",11:"PAL",12:"QAT",13:"SUD",14:"SYR",15:"UAE",16:"YEM"}
print(f"Predicted dialect: {ID2LABEL[probs.argmax().item()]} ({probs.max().item()*100:.1f}%)")

Limitations

Libyan (LIB) and Syrian (SYR) dialect performance is lower due to acoustic overlap with neighboring dialects
The model was trained on broadcast and YouTube speech — performance may degrade on conversational, noisy, or phone-quality audio
Code-switching (mixing Arabic with other languages) may reduce accuracy
Audio shorter than 3 seconds is not supported
Only the Whisper encoder is used — the decoder (transcription) is not activated

Related Models

Model	Window	Accuracy	Repo
Whisper-large-v3 ADI17 15s	15s	92.58%	mesoj8/whisper-large-v3-arabic-dialect-15s
Whisper-large-v3 ADI17 30s	30s	95.11%	This model

Citation

@misc{mesoj8_arabic_dialect_30s_2026,
  author    = {Meso},
  title     = {Whisper-large-v3 Fine-tuned for Arabic Dialect Identification (ADI17, 30s windows)},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mesoj8/whisper-large-v3-arabic-dialect-30s}
}

Downloads last month: 6

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mesoj8
/

whisper-large-v3-arabic-dialect-30s