--- license: mit tags: - audio - automatic-speech-recognition widget: - example_title: sample 1 src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3 - example_title: sample 2 src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3 - example_title: sample 3 src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3 pipeline_tag: automatic-speech-recognition --- Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper model using Bangla mozilla common voice dataset. For training this model used 40k training and 7k Validation of around 400 hours of data. We trained 12000 steps and get word error rate 4.58%. This model was whisper small[244 M] variant model. ```py import os import librosa import torch import torchaudio import numpy as np from transformers import WhisperTokenizer from transformers import WhisperProcessor from transformers import WhisperFeatureExtractor from transformers import WhisperForConditionalGeneration device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3" model_path = "bangla-speech-processing/BanglaASR" feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path) tokenizer = WhisperTokenizer.from_pretrained(model_path) processor = WhisperProcessor.from_pretrained(model_path) model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device) speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3") speech_array = speech_array[0].numpy() speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000) input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features # batch = processor.feature_extractor.pad(input_features, return_tensors="pt") predicted_ids = model.generate(inputs=input_features.to(device))[0] transcription = processor.decode(predicted_ids, skip_special_tokens=True) print(transcription) ``` # Dataset Used Mozilla common voice dataset around 400 hours data both training[40k] and validation[7k] mp3 samples. For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets) # Training Model Information | Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status | | ------------- | ------------- | -------- |-------- | ------------- | ------------- | -------- | tiny | 4 |384 | 6 | 39 M | X | X base | 6 |512 | 8 |74 M | X | X small | 12 |768 | 12 |244 M | ✓ | ✓ medium | 24 |1024 | 16 |769 M | X | X large | 32 |1280 | 20 |1550 M | X | X # Evaluation Word Error Rate 4.58 % For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main) ``` @misc{BanglaASR , title={Transformer Based Whisper Bangla ASR Model}, author={Md Saiful Islam}, howpublished={}, year={2023} } ```