metadata

license: apache-2.0
datasets:
  - mozilla-foundation/common_voice_17_0
language:
  - lv
pipeline_tag: automatic-speech-recognition

General-purpose Latvian ASR model

This is a fine-tuned whisper-large-v3 model for Latvian, trained by AiLab.lv using two general-purpose speech datasets: the Latvian part of Common Voice 17.0, and a Latvian broadcast dataset LATE-Media.

We also provide a 5-bit quantized version of the model in the GGML format.

Training

Fine-tuning was done using the Hugging Face Transformers library with a modified seq2seq script.

Training data:

Dataset	Hours
Latvian CV 17.0 train set (the V1 split)	167
LATE-Media train set	42
Total	209

Evaluation

The model is evaluated on both a Latvian CV 17.0 test set (the V1 split) and a LATE-Media test set.

Dataset	WER	CER
Latvian CV 17.0 V1 - formatted	5.0	1.6
Latvian CV 17.0 V1 - normalized	3.4	1.0
LATE-Media 1.0 - formatted	20.8	8.2
LATE-Media 1.0 - normalized	14.1	5.9

Citation

Please cite this paper if you use this model in your research:

@inproceedings{dargis-etal-2024-balsutalka-lv,
  author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
  title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low-Resource Languages}},
  booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  publisher = {ELRA and ICCL},
  year = {2024},
  pages = {2080--2085},
  url = {https://aclanthology.org/2024.lrec-main.187}
}

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project LATE (VPP-LETONIKA-2021/1-0006). We are grateful to all the participants of the national initiative BalsuTalka.lv for helping to make the Latvian Common Voice dataset much larger and more diverse.