nljubesi's picture
Update README.md
3637650
metadata
language: hr
datasets:
  - parlaspeech-hr
tags:
  - audio
  - automatic-speech-recognition
  - parlaspeech
widget:
  - example_title: example 1
    src: >-
      https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
  - example_title: example 2
    src: >-
      https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav

wav2vec2-xls-r-parlaspeech-hr

This model for Croatian ASR is based on the facebook/wav2vec2-xls-r-300m model and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset ParlaSpeech-HR v1.0.

The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from Plüss et al. was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.

If you use this model, please cite the following paper:

Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.

Metrics

split CER WER
dev 0.0335 0.1046
test 0.0234 0.0761

Usage in transformers

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
    "classla/wav2vec2-xls-r-parlaspeech-hr")
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")


# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")

# read the wav file 
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)

# remove the raw wav file
os.system("rm 00020570a.flac.wav")

# retrieve logits
logits = model.to(device)(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0]).lower()

# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'

Training hyperparameters

In fine-tuning, the following arguments were used:

arg value
per_device_train_batch_size 16
gradient_accumulation_steps 4
num_train_epochs 8
learning_rate 3e-4
warmup_steps 500