Wav2Vec2-Large-XLSR-Bengali
Fine-tuned Noobbbbb/bn-wav2vec2-xlsr-53 on Bengali speech.
Training Details
| Config |
Value |
| GPU |
G4 95GB |
| Epochs |
3 |
| Global Steps |
1,995 |
| Batch Size |
64 |
| Learning Rate |
3e-4 |
| Warmup Steps |
1,000 |
| Train Runtime |
2,829s (~47 min) |
| Samples/Second |
90.13 |
| Steps/Second |
0.705 |
| Total FLOPs |
8.19e+19 |
Validation Metrics (during training)
| Step |
Training Loss |
Validation Loss |
WER |
CER |
| 500 |
1.4964 |
0.2075 |
0.3652 |
0.0944 |
| 1000 |
1.4044 |
0.1968 |
0.3598 |
0.0917 |
| 1500 |
1.4648 |
0.1946 |
0.3552 |
0.0906 |
| 1995 |
1.4648 |
0.1918 |
0.3517 |
0.0898 |
Results
| Metric |
Value |
| Final Training Loss |
1.4461 |
| Best Validation Loss |
0.1918 |
| Best WER (validation) |
0.3517 |
| Best CER (validation) |
0.0898 |
| WER (500 eval samples) |
0.2484 |
| CER (500 eval samples) |
0.0565 |
Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio
processor = Wav2Vec2Processor.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53-v2")
model = Wav2Vec2ForCTC.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53-v2").to("cuda")
def transcribe(audio_path):
speech, sr = torchaudio.load(audio_path)
speech = speech.mean(dim=0).numpy()
if sr != 16000:
speech = torchaudio.functional.resample(
torch.from_numpy(speech), sr, 16000
).numpy()
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)[0]
return processor.decode(pred_ids)
print(transcribe("path/to/audio.mp3"))