Noobbbbb/bengaliai-asr-dataset-125k
Viewer • Updated • 155k • 109
Fine-tuned facebook/wav2vec2-large-xlsr-53 on Bengali speech.
| Metric | Value |
|---|---|
| WER (500 eval samples) | 0.2449 |
| Final Training Loss | 1.0715 |
| Train Runtime | 35,879s (~10 hours) |
| Samples/Second | 34.84 |
| Steps/Second | 0.545 |
| Total FLOPs | 3.96e+20 |
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio
processor = Wav2Vec2Processor.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53")
model = Wav2Vec2ForCTC.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53").to("cuda")
def transcribe(audio_path):
speech, sr = torchaudio.load(audio_path)
speech = speech.mean(dim=0).numpy()
if sr != 16000:
speech = torchaudio.functional.resample(
torch.from_numpy(speech), sr, 16000
).numpy()
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)[0]
return processor.decode(pred_ids)
print(transcribe("path/to/audio.mp3"))
Base model
facebook/wav2vec2-large-xlsr-53