--- license: cc-by-nc-4.0 tags: - speech processing - self-supervision - african languages --- ## Model description This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k) is based on a HuBERT Base architecture (~95M params) [1]. It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa. ### Pretraining data - Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech). - Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje). ## ASR fine-tuning The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model. Fine-tuning is done for each language using the FLEURS dataset [2]. The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top. ## License This model is released under the CC-by-NC 4.0 conditions. ## Publication This model were presented at AfricaNLP 2024. The associated paper is available here: [Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context](https://openreview.net/forum?id=zLOhcft2E7) ### Citation Please cite our paper when using SSA-HuBERT-base-60k model: Caubrière, A., & Gauthier, E. (2024). Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context. In 5th Workshop on African Natural Language Processing (AfricaNLP 2024). **Bibtex citation:** ``` @inproceedings{caubri{\`e}re2024ssaspeechssl, title={Africa-Centric Self-Supervised Pretraining for Multilingual Speech Representation in a Sub-Saharan Context}, author={Antoine Caubri{\`e}re and Elodie Gauthier}, booktitle={5th Workshop on African Natural Language Processing}, year={2024}, url={https://openreview.net/forum?id=zLOhcft2E7}} ``` ## Results The following results are obtained in a greedy mode **(no language model rescoring)**. Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset. | **Language** | **CER** | **CER (joint finetuning)** | **WER** | **WER (joint finetuning)** | | :----------------- | :--------- | :--------- | :--------- | :--------- | | **Afrikaans** | 23.3 | 20.3 | 68.4 | 62.6 | | **Amharic** | 15.9 | 14.9 | 52.7 | 49.0 | | **Fula** | 21.2 | 17.8 | 61.9 | 56.4 | | **Ganda** | 11.5 | 10.7 | 52.8 | 50.3 | | **Hausa** | 10.5 | 9.0 | 32.5 | 29.4 | | **Igbo** | 19.7 | 17.2 | 57.5 | 52.9 | | **Kamba** | 16.1 | 15.6 | 53.9 | 53.7 | | **Lingala** | 8.7 | 6.9 | 24.7 | 20.9 | | **Luo** | 9.9 | 8.2 | 38.9 | 34.9 | | **Northen-Sotho** | 13.5 | 11.7 | 43.2 | 38.9 | | **Nyanja** | 13.3 | 10.9 | 54.2 | 48.3 | | **Oromo** | 22.8 | 20.1 | 78.1 | 74.8 | | **Shona** | 11.6 | 8.3 | 50.2 | 39.3 | | **Somali** | 21.6 | 19.7 | 64.9 | 60.3 | | **Swahili** | 7.1 | 5.5 | 23.8 | 20.3 | | **Umbundu** | 21.7 | 18.8 | 61.7 | 54.2 | | **Wolof** | 19.4 | 17.0 | 55.0 | 50.7 | | **Xhosa** | 11.9 | 9.9 | 51.6 | 45.9 | | **Yoruba** | 24.3 | 23.5 | 67.5 | 65.7 | | **Zulu** | 12.2 | 9.6 | 53.4 | 44.9 | | *Overall average* | *15.8* | *13.8* | *52.3* | *47.7* | ## Reproductibilty We propose a notebook to reproduce the ASR experiments mentioned in our paper. See `SB_ASR_FLEURS_finetuning.ipynb`. By using the `ASR_FLEURS-swahili_hf.yaml` config file, you will be able to run the recipe on Swahili. ## References [1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291. [2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.