File size: 2,896 Bytes

633a9e7
 
 
54c4dba
633a9e7
 
 
b722a56
633a9e7
96f8bab
3637650
b722a56
3135966
d118fdd
633a9e7
 
b722a56
633a9e7
1499f63
633a9e7
691adc6
1499f63
15d267f
691adc6
a5c18d6
 
057825f
 
a5c18d6
 
 
 
 
15d267f
cce5842
 
 
 
 
 
 
 
 
77b03e4
 
cce5842
 
67a746a
 
cce5842
 
 
77b03e4
cce5842
77b03e4
 
 
cce5842
 
 
 
 
77b03e4
cce5842
 
 
77b03e4
cce5842
77b03e4
cce5842
e65b698
a5c18d6
 
e65b698
 
 
 
7652ef8

---
language: hr
datasets:
- parlaspeech-hr
tags:
- audio
- automatic-speech-recognition
- parlaspeech
widget:
- example_title: example 1
  src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
- example_title: example 2
  src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav

---

# wav2vec2-xls-r-parlaspeech-hr

This model for Croatian ASR is based on the [facebook/wav2vec2-xls-r-300m model](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).

If you use this model, please cite the following paper:

Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.16.pdf

## Metrics

Evaluation is performed on the dev and test portions of the [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494) dataset.

|split|CER|WER|
|---|---|---|
|dev|0.0335|0.1046|
|test|0.0234|0.0761|

There are multiple models available, and in terms of CER and WER, the best-performing model is [wav2vec2-large-slavic-parlaspeech-hr-lm](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm).

## Usage in `transformers`

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
    "classla/wav2vec2-xls-r-parlaspeech-hr")
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")


# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")

# read the wav file 
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)

# remove the raw wav file
os.system("rm 00020570a.flac.wav")

# retrieve logits
logits = model.to(device)(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0]).lower()

# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
```



## Training hyperparameters

In fine-tuning, the following arguments were used:

| arg                           | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 16    |
| `gradient_accumulation_steps` | 4     |
| `num_train_epochs`            | 8     |
| `learning_rate`               | 3e-4  |
| `warmup_steps`                | 500   |