Wav2Vec2-Large-XLSR-Bengali

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Bengali speech.

Training Details

  • GPU: A100 80GB
  • Epochs: 10
  • Global Steps: 19,540
  • Batch Size: 64
  • Learning Rate: 3e-4
  • Warmup Steps: 1,000

Results

Metric Value
WER (500 eval samples) 0.2449
Final Training Loss 1.0715
Train Runtime 35,879s (~10 hours)
Samples/Second 34.84
Steps/Second 0.545
Total FLOPs 3.96e+20

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio
processor = Wav2Vec2Processor.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53")
model = Wav2Vec2ForCTC.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53").to("cuda")
def transcribe(audio_path):
    speech, sr = torchaudio.load(audio_path)
    speech = speech.mean(dim=0).numpy()
    if sr != 16000:
        speech = torchaudio.functional.resample(
            torch.from_numpy(speech), sr, 16000
        ).numpy()
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    return processor.decode(pred_ids)
print(transcribe("path/to/audio.mp3"))
Downloads last month
115
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Noobbbbb/bn-wav2vec2-xlsr-53

Finetuned
(363)
this model

Dataset used to train Noobbbbb/bn-wav2vec2-xlsr-53

Evaluation results