metadata

license: apache-2.0
datasets:
  - mozilla-foundation/common_voice_17_0
language:
  - lv
pipeline_tag: automatic-speech-recognition
base_model:
  - openai/whisper-large-v3
new_version: AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19

General-purpose Latvian ASR model

This is a fine-tuned whisper-large-v3 model for Latvian, trained by AiLab.lv using two general-purpose speech datasets: the Latvian part of Common Voice 17.0, and a Latvian broadcast dataset LATE-Media.

We also provide 4-bit, 5-bit and 8-bit quantized versions of the model in the GGML format for the use with whisper.cpp.

NB! This model is superseded by a newer version: whisper-large-v3-lv-late-cv19

Training

Fine-tuning was done using the Hugging Face Transformers library with a modified seq2seq script.

Training data	Hours
Latvian Common Voice 17.0 train set (the V1 split)	167
LATE-Media 1.0 train set	42
Total	209

Evaluation

Testing data	WER	CER
Latvian Common Voice 17.0 test set (V1) - formatted	5.0	1.6
Latvian Common Voice 17.0 test set (V1) - normalized	3.4	1.0
LATE-Media 1.0 test set - formatted	20.8	8.2
LATE-Media 1.0 test set - normalized	14.1	5.9

The Latvian CV 17.0 test set is available here. The LATE-Media 1.0 test set is available here.

Citation

Please cite this paper if you use this model in your research:

@inproceedings{dargis-etal-2024-balsutalka-lv,
  author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
  title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low-Resource Languages}},
  booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  publisher = {ELRA and ICCL},
  year = {2024},
  pages = {2080--2085},
  url = {https://aclanthology.org/2024.lrec-main.187}
}

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project LATE (VPP-LETONIKA-2021/1-0006). We are grateful to all the participants of the national initiative BalsuTalka.lv for helping to make the Latvian Common Voice dataset much larger and more diverse.