mHuBERT-147 / README.md
mzboito's picture
Update README.md
2b9a070 verified
|
raw
history blame
3.65 kB
metadata
license: cc-by-nc-4.0
language:
  - ab
  - af
  - am
  - ar
  - as
  - az
  - ba
  - be
  - bn
  - bo
  - bs
  - br
  - bg
  - ca
  - cs
  - cv
  - cy
  - da
  - de
  - dv
  - el
  - en
  - eo
  - et
  - eu
  - ee
  - fo
  - fa
  - tl
  - fi
  - fr
  - fy
  - ga
  - gl
  - gv
  - gn
  - gu
  - ht
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - ig
  - ia
  - id
  - is
  - it
  - jv
  - ja
  - kn
  - ka
  - kk
  - km
  - rw
  - ky
  - ku
  - ko
  - lo
  - la
  - lv
  - ln
  - lt
  - lb
  - lg
  - ml
  - mr
  - mk
  - mg
  - mt
  - mn
  - mi
  - ms
  - my
  - ne
  - nl
  - nn
  - 'no'
  - oc
  - or
  - pa
  - pl
  - pt
  - ps
  - ro
  - ru
  - sa
  - si
  - sl
  - sk
  - sn
  - sd
  - so
  - st
  - es
  - sq
  - sc
  - sr
  - su
  - sw
  - sv
  - ta
  - tt
  - te
  - tg
  - th
  - tn
  - tk
  - tr
  - tw
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - yo
  - zh

mHuBERT-147 models

mHuBERT-147 are compact and competitive multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.

This repository contains:

  • Fairseq checkpoint (original);
  • HuggingFace checkpoint;
  • Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).

Citing

@inproceedings{boito2024mhubert,
author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
year=2024,
booktitle={Interspeech 2024},
}

Additional Information

Manifest list: https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest

Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.

Fairseq fork: https://github.com/utter-project/fairseq

Scripts for pre-processing/faiss clustering: https://github.com/utter-project/mHuBERT-147-scripts

Languages present not indexed by Huggingface: Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).

Datasets Included

For ASR/ST/TTS datasets, only train set is used.

Funding

This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631. For more information go to https://he-utter.eu/