SSA-HuBERT-base-5k / README.md
Antoine-caubriere's picture
Update README.md
4e25b8c verified
metadata
license: cc-by-nc-4.0
tags:
  - speech processing
  - self-supervision
  - african languages

Model description

This self-supervised speech model (a.k.a. SSA-HuBERT-base-5k) is based on a HuBERT Base architecture (~95M params) [1].
It was trained on nearly 5 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa.
It is a balanced version in gender and languages representation compared to the SSA-HuBERT-base-60k.

Pretraining data

  • Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech).

  • Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).

License

This model is released under the CC-by-NC 4.0 conditions.

Publication

This model were presented at JEP-TALN 2024. The associated paper is available here: Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context

Citation

Please cite our paper when using SSA-HuBERT-base-5k model:

Antoine Caubrière, Elodie Gauthier. Représentation de la parole multilingue par apprentissage auto-supervisé dans un contexte subsaharien. 35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024), Jul 2024, Toulouse, France. pp.163-172. ⟨hal-04623069⟩

Bibtex:

@inproceedings{caubriere:hal-04623069,    
  TITLE = {{Repr{\'e}sentation de la parole multilingue par apprentissage auto-supervis{\'e} dans un contexte subsaharien}},    
  AUTHOR = {Caubri{\`e}re, Antoine and Gauthier, Elodie},    
  URL = {https://inria.hal.science/hal-04623069},    
  BOOKTITLE = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},    
  ADDRESS = {Toulouse, France},    
  EDITOR = {BALAGUER and Mathieu and BENDAHMAN and Nihed and HO-DAC and Lydia-Mai and MAUCLAIR and Julie and MORENO and Jose G and PINQUIER and Julien},    
  PUBLISHER = {{ATALA \& AFPC}},    
  PAGES = {163-172},    
  YEAR = {2024},    
  MONTH = Jul,    
  KEYWORDS = {Apprentissage auto-supervis{\'e} ; Langues subsaharienne ; Reconnaissance de la parole multilingue ; HuBERT},    
  PDF = {https://inria.hal.science/hal-04623069/file/4347.pdf},    
  HAL_ID = {hal-04623069},    
  HAL_VERSION = {v1},    
}

ASR fine-tuning

The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.
Fine-tuning is done for each language using the FLEURS dataset [2].
The pretrained model (SSA-HuBERT-base-5k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.

Results

The following results are obtained in a greedy mode (no language model rescoring).
Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.

Languages CER CER (joint finetuning) WER WER (joint finetuning)
Afrikaans 23.8 20.3 68.3 61.1
Amharic 15.5 14.3 51.4 47.6
Fula 21.2 17.7 60.6 55.4
Ganda 11.7 11.1 53.3 52.0
Hausa 11.2 10.1 35.6 33.8
Igbo 20.9 17.2 57.9 52.4
Kamba 16.3 15.9 53.7 54.3
Lingala 8.7 7.4 24.2 21.4
Luo 10.2 8.4 38.5 34.3
Northen-Sotho 14.4 11.5 44.6 38.6
Nyanja 13.7 11.3 54.5 48.1
Oromo 22.9 21.2 77.4 74.3
Shona 11.2 8.7 48.2 39.7
Somali 21.9 20.0 64.5 60.6
Swahili 8.6 6.5 28.8 24.4
Umbundu 21.7 20.7 60.8 57.0
Wolof 19.2 17.3 54.2 50.1
Xhosa 12.4 10.1 52.3 47.1
Yoruba 25.0 23.8 68.0 66.6
Zulu 12.4 10.0 53.0 46.1
Overall average 16.1 14.1 52.5 48.2

Reproductibilty

We propose a notebook to reproduce the ASR experiments mentioned in our paper. See SB_ASR_FLEURS_finetuning.ipynb.
By using the ASR_FLEURS-swahili_hf.yaml config file, you will be able to run the recipe on Swahili.

References

[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.