Update README.md
Browse files
README.md
CHANGED
@@ -139,7 +139,7 @@ language:
|
|
139 |
|
140 |
mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
|
141 |
Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
|
142 |
-
Training employs a two-level language, data source up-sampling during training. See more information in our paper.
|
143 |
|
144 |
**This repository contains:**
|
145 |
* Fairseq checkpoint (original);
|
@@ -147,24 +147,22 @@ Training employs a two-level language, data source up-sampling during training.
|
|
147 |
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
|
148 |
|
149 |
**Related Models:**
|
150 |
-
* Second Iteration repository
|
151 |
-
* First Iteration repository
|
152 |
-
* CommonVoice Prototype (12 languages)
|
153 |
|
154 |
# Training
|
155 |
|
156 |
-
**Manifest list
|
157 |
|
158 |
-
|
159 |
|
160 |
-
**
|
161 |
-
|
162 |
-
**Scripts for pre-processing/faiss clustering:** https://github.com/utter-project/mHuBERT-147-scripts
|
163 |
|
164 |
# ML-SUPERB Scores
|
165 |
|
166 |
mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
|
167 |
-
See more information in our paper.
|
168 |
|
169 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)
|
170 |
|
|
|
139 |
|
140 |
mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
|
141 |
Different from *traditional* HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units.
|
142 |
+
Training employs a two-level language, data source up-sampling during training. See more information in [our paper](https://arxiv.org/pdf/2406.06371).
|
143 |
|
144 |
**This repository contains:**
|
145 |
* Fairseq checkpoint (original);
|
|
|
147 |
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
|
148 |
|
149 |
**Related Models:**
|
150 |
+
* [Second Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter)
|
151 |
+
* [First Iteration repository](https://huggingface.co/utter-project/mHuBERT-147-base-1st-iter)
|
152 |
+
* [CommonVoice Prototype (12 languages)](https://huggingface.co/utter-project/hutter-12-3rd-base)
|
153 |
|
154 |
# Training
|
155 |
|
156 |
+
* **[Manifest list available here.](https://huggingface.co/utter-project/mHuBERT-147-base-3rd-iter/tree/main/manifest)** Please note that since training, there were CommonVoice removal requests. This means that some of the listed files are no longer available.
|
157 |
|
158 |
+
* **[Fairseq fork](https://github.com/utter-project/fairseq)** contains the scripts for training with multilingual batching with two-level up-sampling.
|
159 |
|
160 |
+
* **[Scripts for pre-processing/faiss clustering available here.](https://github.com/utter-project/mHuBERT-147-scripts)**
|
|
|
|
|
161 |
|
162 |
# ML-SUPERB Scores
|
163 |
|
164 |
mHubert-147 reaches second and first position in the 10min and 1h leaderboards respectively. We achieve new SOTA scores for three LID tasks.
|
165 |
+
See more information in [our paper](https://arxiv.org/pdf/2406.06371).
|
166 |
|
167 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62262e19d36494a6f743a28d/chXjExnWc3rhhtdsyiU-W.png)
|
168 |
|