license: cc-by-nc-4.0
language:
- ab
- af
- am
- ar
- as
- az
- ba
- be
- bn
- bo
- bs
- br
- bg
- ca
- cs
- cv
- cy
- da
- de
- dv
- el
- en
- eo
- et
- eu
- ee
- fo
- fa
- tl
- fi
- fr
- fy
- ga
- gl
- gv
- gn
- gu
- ht
- ha
- he
- hi
- hr
- hu
- hy
- ig
- ia
- id
- is
- it
- jv
- ja
- kn
- ka
- kk
- km
- rw
- ky
- ku
- ko
- lo
- la
- lv
- ln
- lt
- lb
- lg
- ml
- mr
- mk
- mg
- mt
- mn
- mi
- ms
- my
- ne
- nl
- nn
- 'no'
- oc
- or
- pa
- pl
- pt
- ps
- ro
- ru
- sa
- si
- sl
- sk
- sn
- sd
- so
- st
- es
- sq
- sc
- sr
- su
- sw
- sv
- ta
- tt
- te
- tg
- th
- tn
- tk
- tr
- tw
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- yo
- zh
mHuBERT-147 models
mHuBERT-147 are compact and competitive multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.
This repository contains:
- Fairseq checkpoint (original);
- HuggingFace checkpoint;
- Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
Citing
@inproceedings{boito2024mhubert,
author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
year=2024,
booktitle={Interspeech 2024},
}
Additional Information
Manifest list: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
Fairseq fork: https://github.com/utter-project/fairseq
Scripts for pre-processing/faiss clustering: https://github.com/utter-project/mHuBERT-147-scripts
Languages present not indexed by Huggingface: Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
Datasets Included
For ASR/ST/TTS datasets, only train set is used.
- Aishell and AISHELL-3
- BibleTTS
- ClovaCall
- CommonVoice v11
- Google TTS data: Javanese, Khmer, Nepali, Sundanese, South African Languages, Bengali Languages
- IISc-MILE: Tamil, Kannada
- Japanese Versatile Speech
- Kokoro
- Kosp2e
- Media Speech: Turkish Only
- Multilingual LibriSpeech
- Samrómur
- THCHS-30 and THUYG-20
- VoxLingua107
- VoxPopuli
Funding
This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631. For more information go to https://he-utter.eu/