utter-project
/

mHuBERT-147

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

mzboito commited on May 17

Commit

c3abc14

•

1 Parent(s): 9478075

Update README.md

Files changed (1) hide show

README.md +15 -6

README.md CHANGED Viewed

@@ -125,7 +125,7 @@ language:
 ## mHuBERT-147 models
-mHuBERT-147 are multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.
 This repository contains:
 * Fairseq checkpoint (original);
@@ -135,17 +135,26 @@ This repository contains:
 # Citing
-```
 [PAPER GOES HERE]
-'''
-# Other information
-**Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
 **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
-**Datasets:** For ASR/ST/TTS datasets, only train set is used.
 * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
 * [BibleTTS](https://www.openslr.org/129/)
 * [ClovaCall](https://github.com/clovaai/ClovaCall)

 ## mHuBERT-147 models
+mHuBERT-147 are compact and competitive multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.
 This repository contains:
 * Fairseq checkpoint (original);
 # Citing
 [PAPER GOES HERE]
+# Additional Information
 **Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
+Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
+**Fairseq fork:** https://github.com/utter-project/fairseq
+**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
+**Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
+# Datasets Included
+For ASR/ST/TTS datasets, only train set is used.
 * [Aishell](https://www.openslr.org/33/) and [AISHELL-3](https://www.openslr.org/93/)
 * [BibleTTS](https://www.openslr.org/129/)
 * [ClovaCall](https://github.com/clovaai/ClovaCall)