saiful9379
commited on
Commit
•
fb730d7
1
Parent(s):
b3c405c
update readme
Browse files
README.md
CHANGED
@@ -11,4 +11,81 @@ widget:
|
|
11 |
- example_title: sample 3
|
12 |
src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
|
13 |
pipeline_tag: automatic-speech-recognition
|
14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
- example_title: sample 3
|
12 |
src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
|
13 |
pipeline_tag: automatic-speech-recognition
|
14 |
+
---
|
15 |
+
|
16 |
+
Bangla ASR[Whisper BanglaASR] model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper for Bangla mozilla common voice dataset.
|
17 |
+
For training Bangla ASR model here used 40k traning and 7k Validation around 400 hours data. We trained 12000 steps this model and get word
|
18 |
+
error rate 4.58%.
|
19 |
+
|
20 |
+
|
21 |
+
```py
|
22 |
+
|
23 |
+
import os
|
24 |
+
import librosa
|
25 |
+
import torch
|
26 |
+
import torchaudio
|
27 |
+
import numpy as np
|
28 |
+
|
29 |
+
from transformers import WhisperTokenizer
|
30 |
+
from transformers import WhisperProcessor
|
31 |
+
from transformers import WhisperFeatureExtractor
|
32 |
+
from transformers import WhisperForConditionalGeneration
|
33 |
+
|
34 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
35 |
+
|
36 |
+
mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
|
37 |
+
|
38 |
+
model_path = "bangla-speech-processing/BanglaASR"
|
39 |
+
|
40 |
+
|
41 |
+
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
|
42 |
+
tokenizer = WhisperTokenizer.from_pretrained(model_path)
|
43 |
+
processor = WhisperProcessor.from_pretrained(model_path)
|
44 |
+
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
|
45 |
+
|
46 |
+
|
47 |
+
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
|
48 |
+
speech_array = speech_array[0].numpy()
|
49 |
+
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
|
50 |
+
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
|
51 |
+
|
52 |
+
# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
|
53 |
+
predicted_ids = model.generate(inputs=input_features.to(device))[0]
|
54 |
+
|
55 |
+
|
56 |
+
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
|
57 |
+
|
58 |
+
print(transcription)
|
59 |
+
|
60 |
+
```
|
61 |
+
|
62 |
+
|
63 |
+
# Dataset
|
64 |
+
Use Mozilla common voice dataset. we used 400 hours data both training 40k and validation 7k mp3 samples.
|
65 |
+
For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets)
|
66 |
+
|
67 |
+
# Training Model Information
|
68 |
+
|
69 |
+
|
70 |
+
| Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status |
|
71 |
+
| ------------- | ------------- | -------- |-------- | ------------- | ------------- | -------- |
|
72 |
+
tiny | 4 |384 | 6 | 39 M | X | X
|
73 |
+
base | 6 |512 | 8 |74 M | X | X
|
74 |
+
small | 12 |768 | 12 |244 M | ✓ | ✓
|
75 |
+
medium | 24 |1024 | 16 |769 M | X | X
|
76 |
+
large | 32 |1280 | 20 |1550 M | X | X
|
77 |
+
|
78 |
+
# Evaluation
|
79 |
+
|
80 |
+
Word Error Rate 4.58 %
|
81 |
+
|
82 |
+
For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main)
|
83 |
+
|
84 |
+
```
|
85 |
+
@misc{BanglaASR ,
|
86 |
+
title={Transformer Based Whisper Bangla ASR Model},
|
87 |
+
author={Md Saiful Islam},
|
88 |
+
howpublished={},
|
89 |
+
year={2023}
|
90 |
+
}
|
91 |
+
```
|