Wav2Vec2-Large-XLSR-Bengali

Fine-tuned Noobbbbb/bn-wav2vec2-xlsr-53 on Bengali speech.

Training Details

Config Value
GPU G4 95GB
Epochs 3
Global Steps 1,995
Batch Size 64
Learning Rate 3e-4
Warmup Steps 1,000
Train Runtime 2,829s (~47 min)
Samples/Second 90.13
Steps/Second 0.705
Total FLOPs 8.19e+19

Validation Metrics (during training)

Step Training Loss Validation Loss WER CER
500 1.4964 0.2075 0.3652 0.0944
1000 1.4044 0.1968 0.3598 0.0917
1500 1.4648 0.1946 0.3552 0.0906
1995 1.4648 0.1918 0.3517 0.0898

Results

Metric Value
Final Training Loss 1.4461
Best Validation Loss 0.1918
Best WER (validation) 0.3517
Best CER (validation) 0.0898
WER (500 eval samples) 0.2484
CER (500 eval samples) 0.0565

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio
processor = Wav2Vec2Processor.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53-v2")
model = Wav2Vec2ForCTC.from_pretrained("Noobbbbb/bn-wav2vec2-xlsr-53-v2").to("cuda")
def transcribe(audio_path):
    speech, sr = torchaudio.load(audio_path)
    speech = speech.mean(dim=0).numpy()
    if sr != 16000:
        speech = torchaudio.functional.resample(
            torch.from_numpy(speech), sr, 16000
        ).numpy()
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    return processor.decode(pred_ids)
print(transcribe("path/to/audio.mp3"))
Downloads last month
74
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Noobbbbb/bn-wav2vec2-xlsr-53-v2

Evaluation results