Whisper Large v3 Turbo (Albanian Fine-Tuned) - v2

This is a fine-tuned version of the Whisper Large v3 Turbo model, optimized for Albanian speech-to-text transcription. It achieves a Word Error Rate (WER) of 6.98% on a held-out evaluation set.

Model Details

Base Model: openai/whisper-large-v3-turbo
Language: Albanian (sq)

Training Dataset

Source: Mozilla Common Voice version 19 (available in HF as Kushtrim/common_voice_19_sq)
Description: Audio clips ranging from 5-30 seconds, in spoken Albanian.

Training Details

The model was fine-tuned on an NVIDIA A100 GPU (40GB) using the transformers library. Below are the key training arguments:

Argument	Value	Description
`per_device_train_batch_size`	8	Training batch size per GPU
`per_device_eval_batch_size`	2	Evaluation batch size per GPU
`gradient_accumulation_steps`	1	Steps to accumulate gradients (effective batch size = 8)
`num_train_epochs`	3	Number of training epochs
`learning_rate`	1e-5	Initial learning rate
`warmup_steps`	300	Number of warmup steps for learning rate
`evaluation_strategy`	"steps"	Evaluate every `eval_steps` during training
`eval_steps`	250	Frequency of evaluation (every 250 steps)
`fp16`	True	Use mixed precision training (16-bit floats)

Total Steps: ~3,540 (completed 3,500)
Hardware: NVIDIA A100 (40GB)
Libraries:
- transformers==4.38.2
- torch==2.2.1

Performance

Step	Training Loss	Validation Loss	WER
250	0.4744	0.3991	34.03%
500	0.3421	0.3426	30.42%
750	0.2871	0.2808	26.09%
1000	0.2401	0.2258	21.31%
1250	0.1809	0.1998	19.15%
1500	0.1142	0.1827	17.33%
1750	0.1051	0.1611	15.19%
2000	0.0930	0.1464	13.82%
2250	0.0827	0.1313	11.79%
2500	0.0420	0.1139	10.50%
2750	0.0330	0.1124	9.58%
3000	0.0255	0.1006	8.38%
3250	0.0256	0.0905	7.48%
3500	0.0204	0.0889	6.98%

Final WER: 6.98% (at step 3500)

Flutra
/

whisper-large-v3-turbo-sq-v2

Whisper Large v3 Turbo (Albanian Fine-Tuned) - v2

Model Details

Training Dataset

Training Details

Performance

Space using Flutra/whisper-large-v3-turbo-sq-v2 1