Edit model card

General-purpose Latvian ASR model

This is a fine-tuned whisper-large-v3 model for Latvian, trained by AiLab.lv using two general-purpose speech datasets: the Latvian part of Common Voice 17.0, and a Latvian broadcast dataset LATE-Media.

We also provide 4-bit, 5-bit and 8-bit quantized versions of the model in the GGML format for the use with whisper.cpp.

Training

Fine-tuning was done using the Hugging Face Transformers library with a modified seq2seq script.

Training data Hours
Latvian Common Voice 17.0 train set (the V1 split) 167
LATE-Media 1.0 train set 42
Total 209

Evaluation

Testing data WER CER
Latvian Common Voice 17.0 test set (V1) - formatted 5.0 1.6
Latvian Common Voice 17.0 test set (V1) - normalized 3.4 1.0
LATE-Media 1.0 test set - formatted 20.8 8.2
LATE-Media 1.0 test set - normalized 14.1 5.9

The Latvian CV 17.0 test set is available here. The LATE-Media 1.0 test set is available here.

Citation

Please cite this paper if you use this model in your research:

@inproceedings{dargis-etal-2024-balsutalka-lv,
  author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
  title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low-Resource Languages}},
  booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  publisher = {ELRA and ICCL},
  year = {2024},
  pages = {2080--2085},
  url = {https://aclanthology.org/2024.lrec-main.187}
}

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project LATE (VPP-LETONIKA-2021/1-0006). We are grateful to all the participants of the national initiative BalsuTalka.lv for helping to make the Latvian Common Voice dataset much larger and more diverse.

Downloads last month
107
Safetensors
Model size
1.61B params
Tensor type
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17

Space using AiLab-IMCS-UL/whisper-large-v3-lv-late-cv17 1