utter-project
/

mHuBERT-147

@@ -139,7 +139,7 @@ language:
 mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
 Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
-Training employs a two-level language, data source up-sampling during training. See more information in our paper.
 **This repository contains:**
 * Fairseq checkpoint (original);
@@ -147,24 +147,22 @@ Training employs a two-level language, data source up-sampling during training.
 * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
 **Related Models:**
-* Second Iteration repository: https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter
-* First Iteration repository: https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter
-* CommonVoice Prototype (12 languages): https://huggingface.co/utter-project/hutter-12-3rd-base
 # Training
-**Manifest list:** https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest
-Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
-**Fairseq fork:** https://github.com/utter-project/fairseq
-**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
 # ML-SUPERB Scores
 mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
-See more information in our paper.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)

 mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
 Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
+Training employs a two-level language, data source up-sampling during training. See more information in [our paper](https://arxiv.org/pdf/2406.06371).
 **This repository contains:**
 * Fairseq checkpoint (original);
 * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
 **Related Models:**
+* [Second Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter)
+* [First Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter)
+* [CommonVoice Prototype (12 languages)](https://huggingface.co/utter-project/hutter-12-3rd-base)
 # Training
+* **[Manifest list available here.](https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest)** Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
+* **[Fairseq fork](https://github.com/utter-project/fairseq)** contains the scripts for training with multilingual batching with two-level up-sampling.
+* **[Scripts for pre-processing/faiss clustering available here.](https://github.com/utter-project/mHuBERT-147-scripts)**
 # ML-SUPERB Scores
 mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
+See more information in [our paper](https://arxiv.org/pdf/2406.06371).
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)