File size: 2,896 Bytes
633a9e7 54c4dba 633a9e7 b722a56 633a9e7 96f8bab 3637650 b722a56 3135966 d118fdd 633a9e7 b722a56 633a9e7 1499f63 633a9e7 691adc6 1499f63 15d267f 691adc6 a5c18d6 057825f a5c18d6 15d267f cce5842 77b03e4 cce5842 67a746a cce5842 77b03e4 cce5842 77b03e4 cce5842 77b03e4 cce5842 77b03e4 cce5842 77b03e4 cce5842 e65b698 a5c18d6 e65b698 7652ef8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
language: hr
datasets:
- parlaspeech-hr
tags:
- audio
- automatic-speech-recognition
- parlaspeech
widget:
- example_title: example 1
src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
- example_title: example 2
src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav
---
# wav2vec2-xls-r-parlaspeech-hr
This model for Croatian ASR is based on the [facebook/wav2vec2-xls-r-300m model](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).
If you use this model, please cite the following paper:
Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.16.pdf
## Metrics
Evaluation is performed on the dev and test portions of the [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494) dataset.
|split|CER|WER|
|---|---|---|
|dev|0.0335|0.1046|
|test|0.0234|0.0761|
There are multiple models available, and in terms of CER and WER, the best-performing model is [wav2vec2-large-slavic-parlaspeech-hr-lm](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm).
## Usage in `transformers`
```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
"classla/wav2vec2-xls-r-parlaspeech-hr")
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr")
# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020570a.flac.wav")
# read the wav file
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)
# remove the raw wav file
os.system("rm 00020570a.flac.wav")
# retrieve logits
logits = model.to(device)(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0]).lower()
# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
```
## Training hyperparameters
In fine-tuning, the following arguments were used:
| arg | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 16 |
| `gradient_accumulation_steps` | 4 |
| `num_train_epochs` | 8 |
| `learning_rate` | 3e-4 |
| `warmup_steps` | 500 | |