metadata
language: sr
datasets:
- juznevesti-sr
tags:
- audio
- automatic-speech-recognition
widget:
- example_title: Croatian example 1
src: >-
https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
- example_title: Croatian example 2
src: >-
https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav
- example_title: Croatian example 3
src: >-
https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav
wav2vec2-large-juznevesti
This model for Serbian ASR is based on the facebook/wav2vec2-xls-r-300m model and was fine-tuned with 58 hours of audio and transcripts from Južne vesti, programme '15 minuta'.
For more info on the dataset creation see this repo.
Metrics
Evaluation is performed on the dev and test portions of the JuzneVesti dataset
dev | test | |
---|---|---|
WER | 0.295206 | 0.290094 |
CER | 0.140766 | 0.137642 |
Usage in transformers
Tested with transformers==4.18.0
, torch==1.11.0
, and SoundFile==0.10.3.post1
.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
"5roop/wav2vec2-xls-r-juznevesti-sr")
model = Wav2Vec2ForCTC.from_pretrained("5roop/wav2vec2-xls-r-juznevesti-sr")
# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")
# read the wav file
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)
# remove the raw wav file
os.system("rm 00020570a.flac.wav")
# retrieve logits
logits = model.to(device)(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
transcription # 'velik broj poslovnih subjekata posluje sa minosom velik deo'
Training hyperparameters
In fine-tuning, the following arguments were used:
arg | value |
---|---|
per_device_train_batch_size |
16 |
gradient_accumulation_steps |
4 |
num_train_epochs |
20 |
learning_rate |
3e-4 |
warmup_steps |
500 |