wav2vec2-xls-r-slavic-pomak

Pomak is an endangered South East Slavic language variety spoken in Nothern Greece. This is the first automatic speech recognition (ASR) model for Pomak. To train the model, we fine-tuned a Slavic model (classla/wav2vec2-large-slavic-parlaspeech-hr) on 11h of recorded Pomak speech.

Recordings

Four native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a corpus of 14h.

Speaker	Gender	Total recorded hours
NK9dIF	F	4h 44m 45s
xoVY9q	M	4h 36m 12s
9G75fk	F	1h 44m 03s
n5WzHj	M	3h 44m 04s

To fine-tune the model, we split the long recordings into smaller segments of a maximum of 25 seconds each. This removed the majority of pauses and resulted in a total dataset duration of 11h 8m.

Metrics

We evaluated the model on the test set split, which consists of 10% of the dataset recordings.

Model	CER	WER
pre-trained	87.31%	31.47%
fine-tuned	9.06%	3.12%

Training hyperparameters

We fine-tuned the baseline model (wav2vec2-large-slavic-parlaspeech-hr) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:

arg	value
`per_device_train_batch_size`	8
`gradient_accumulation_steps`	2
`num_train_epochs`	35
`learning_rate`	3e-4
`warmup_steps`	500

Citation

To cite this work or read more about the training pipeline, see this paper

@inproceedings{tsoukala-etal-2023-asr,
    title = "{ASR} pipeline for low-resourced languages: A case study on Pomak",
    author = "Tsoukala, Chara  and
      Kritsis, Kosmas  and
      Douros, Ioannis  and
      Katsamanis, Athanasios  and
      Kokkas, Nikolaos  and
      Arampatzakis, Vasileios  and
      Sevetlidis, Vasileios  and
      Markantonatou, Stella  and
      Pavlidis, George",
    booktitle = "Proceedings of the Second Workshop on NLP Applications to Field Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.fieldmatters-1.5",
    doi = "10.18653/v1/2023.fieldmatters-1.5",
    pages = "40--45",
    abstract = "Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.",
}