ASR_yoruba_Azure

This model achieves the following results on the evaluation set:

Loss: 0.2229
Wer: 0.2481

Model description

This model, based on the wav2vec2 architecture, has 965 million parameters and was trained on 36 hours of Yoruba audio data from multiple speakers. It currently achieves a Word Error Rate (WER) of twenty-four percent.

Intended uses & limitations

-The model is designed for Automatic Speech Recognition (ASR) in the Yoruba language. -It can be utilized for transcribing spoken Yoruba into text, supporting applications like voice-activated systems, automated transcription services, and linguistic research. -The model's current Word Error Rate (WER) of twenty-four percent indicates room for improvement in transcription accuracy. Performance may be affected by background noise, accents, and variations in speaker pronunciation. -It is optimized for short audio clips (up to five minutes) due to GPU memory constraints

Training and evaluation data

-The model was trained on 36 hours of Yoruba audio data, encompassing various speakers to capture diverse accents and speech patterns. The data includes conversational speech, read speech, and different audio qualities to enhance robustness.

-The evaluation data set used to measure the model's performance included a representative sample of Yoruba speech not seen during training. The WER of twenty-four percent reflects the model's accuracy on this evaluation data, which includes various speech scenarios.

Training procedure

The model was trained using the wav2vec2 architecture, which involves pre-training on large-scale unlabeled data followed by fine-tuning on specific Yoruba audio data. The training process included optimizing model parameters to minimize transcription errors, employing techniques such as data augmentation and regularization to improve performance. Training was performed on high-performance GPUs to handle the large-scale data and model parameters, with iterative evaluations to monitor progress and adjust training strategies.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.001
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 400
num_epochs: 64
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.2635	1.5238	400	0.2428	0.2582
0.2494	3.0476	800	0.2321	0.2571
0.2438	4.5714	1200	0.2315	0.2517
0.2417	6.0952	1600	0.2282	0.2591
0.2349	7.6190	2000	0.2299	0.2529
0.237	9.1429	2400	0.2301	0.2545
0.2355	10.6667	2800	0.2262	0.2559
0.2321	12.1905	3200	0.2290	0.2527
0.235	13.7143	3600	0.2265	0.2546
0.2289	15.2381	4000	0.2260	0.2551
0.2305	16.7619	4400	0.2267	0.2519
0.2314	18.2857	4800	0.2308	0.2583
0.2283	19.8095	5200	0.2243	0.2486
0.2288	21.3333	5600	0.2288	0.2563
0.2303	22.8571	6000	0.2244	0.2466
0.2275	24.3810	6400	0.2266	0.2471
0.2261	25.9048	6800	0.2264	0.2509
0.2271	27.4286	7200	0.2244	0.2494
0.2321	28.9524	7600	0.2257	0.2477
0.2261	30.4762	8000	0.2243	0.2533
0.2247	32.0	8400	0.2255	0.2449
0.2229	33.5238	8800	0.2268	0.2471
0.2242	35.0476	9200	0.2233	0.2459
0.2299	36.5714	9600	0.2268	0.2527
0.2272	38.0952	10000	0.2248	0.2471
0.2242	39.6190	10400	0.2249	0.2462
0.2249	41.1429	10800	0.2245	0.2469
0.2244	42.6667	11200	0.2249	0.2534
0.2264	44.1905	11600	0.2247	0.2457
0.2252	45.7143	12000	0.2237	0.2464
0.2239	47.2381	12400	0.2240	0.2495
0.2268	48.7619	12800	0.2240	0.2494
0.2264	50.2857	13200	0.2243	0.2528
0.2244	51.8095	13600	0.2238	0.2495
0.2236	53.3333	14000	0.2226	0.2475
0.2266	54.8571	14400	0.2230	0.2470
0.225	56.3810	14800	0.2232	0.2453
0.2233	57.9048	15200	0.2227	0.2467
0.223	59.4286	15600	0.2226	0.2496
0.224	60.9524	16000	0.2226	0.2472
0.2225	62.4762	16400	0.2229	0.2481

Framework versions

Transformers 4.44.0.dev0
Pytorch 2.4.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1

FarmerlineML
/

ASR_yoruba_AI4G

You need to agree to share your contact information to access this model