Commit
·
308b176
1
Parent(s):
7afbcaa
Update README
Browse files
README.md
CHANGED
@@ -9,22 +9,31 @@ datasets:
|
|
9 |
license: cc-by-4.0
|
10 |
---
|
11 |
|
12 |
-
## ESPnet2 LID model
|
13 |
|
14 |
### `espnet/geolid_combined_shared_trainable`
|
15 |
|
16 |
-
This
|
17 |
|
18 |
-
|
|
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
```bash
|
24 |
cd espnet
|
25 |
|
26 |
pip install -e .
|
27 |
-
|
28 |
cd egs2/geolid/lid1
|
29 |
|
30 |
# Download the exp_combined to egs2/geolid/lid1
|
@@ -33,7 +42,71 @@ hf download espnet/geolid_combined_shared_trainable --local-dir . --exclude "REA
|
|
33 |
./run_combined.sh --skip_data_prep false --skip_train true
|
34 |
```
|
35 |
|
|
|
|
|
|
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## LID config
|
39 |
|
@@ -285,9 +358,16 @@ distributed: false
|
|
285 |
|
286 |
|
287 |
|
288 |
-
###
|
289 |
|
290 |
```BibTex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
291 |
@inproceedings{watanabe2018espnet,
|
292 |
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
|
293 |
title={{ESPnet}: End-to-End Speech Processing Toolkit},
|
@@ -297,23 +377,4 @@ distributed: false
|
|
297 |
doi={10.21437/Interspeech.2018-1456},
|
298 |
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
|
299 |
}
|
300 |
-
|
301 |
-
|
302 |
-
|
303 |
-
|
304 |
-
|
305 |
-
|
306 |
-
```
|
307 |
-
|
308 |
-
or arXiv:
|
309 |
-
|
310 |
-
```bibtex
|
311 |
-
@misc{watanabe2018espnet,
|
312 |
-
title={ESPnet: End-to-End Speech Processing Toolkit},
|
313 |
-
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
|
314 |
-
year={2018},
|
315 |
-
eprint={1804.00015},
|
316 |
-
archivePrefix={arXiv},
|
317 |
-
primaryClass={cs.CL}
|
318 |
-
}
|
319 |
```
|
|
|
9 |
license: cc-by-4.0
|
10 |
---
|
11 |
|
12 |
+
## ESPnet2 Spoken Language Identification (LID) model
|
13 |
|
14 |
### `espnet/geolid_combined_shared_trainable`
|
15 |
|
16 |
+
This geolocation-aware language identification (LID) model is developed using the [ESPnet](https://github.com/espnet/espnet/) toolkit. It integrates the powerful pretrained MMS-1B model ([facebook/mms-1b](https://huggingface.co/facebook/mms-1b)) as the encoder and employs ECAPA-TDNN ([arXiv](https://arxiv.org/pdf/2005.07143)) as the embedding extractor to achieve robust spoken language identification.
|
17 |
|
18 |
+
The main innovations of this model are:
|
19 |
+
1. Incorporating geolocation prediction as an auxiliary task during training.
|
20 |
+
2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information.
|
21 |
+
This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations.
|
22 |
|
23 |
+
For further details on the geolocation-aware LID methodology, please refer to our paper: *Geolocation-Aware Robust Spoken Language Identification* (arXiv link to be added).
|
24 |
+
|
25 |
+
### Usage Guide: How to use in ESPnet2
|
26 |
+
|
27 |
+
#### Prerequisites
|
28 |
+
First, ensure you have ESPnet installed. If not, follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html).
|
29 |
+
|
30 |
+
#### Quick Start
|
31 |
+
Run the following commands to set up and use the pre-trained model:
|
32 |
|
33 |
```bash
|
34 |
cd espnet
|
35 |
|
36 |
pip install -e .
|
|
|
37 |
cd egs2/geolid/lid1
|
38 |
|
39 |
# Download the exp_combined to egs2/geolid/lid1
|
|
|
42 |
./run_combined.sh --skip_data_prep false --skip_train true
|
43 |
```
|
44 |
|
45 |
+
This will download the pre-trained model and run inference using the VoxLingua107 test data.
|
46 |
+
|
47 |
+
### Train and Evaluation Datasets
|
48 |
|
49 |
+
The training utilized a combined dataset, merging five domain-specific corpora, resulting in 9,865 hours of speech data covering 157 languages.
|
50 |
+
|
51 |
+
| Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) |
|
52 |
+
| ------------- | ----------- | ------------------ | ------- | --------------------------- |
|
53 |
+
| [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) | YouTube | 107/33 | No | Seen |
|
54 |
+
| [Babel](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=31a13cefb42647e924e0d2778d341decc44c40e9) | Telephone | 25/25 | No | Seen |
|
55 |
+
| [FLEURS](https://huggingface.co/datasets/google/xtreme_s) | Read speech | 102/102 | No | Seen |
|
56 |
+
| [ML-SUPERB 2.0](https://huggingface.co/datasets/espnet/ml_superb_hf) | Mixed | 137/(137, 8) | Yes | Seen |
|
57 |
+
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Parliament | 16/16 | No | Seen |
|
58 |
+
|
59 |
+
### Results
|
60 |
+
|
61 |
+
**Accuracy (%) on In-domain and Out-of-domain Test Sets**
|
62 |
+
|
63 |
+
<style>
|
64 |
+
.hf-model-cell {
|
65 |
+
max-width: 120px;
|
66 |
+
overflow-x: auto;
|
67 |
+
white-space: nowrap;
|
68 |
+
scrollbar-width: thin;
|
69 |
+
scrollbar-color: #888 #f1f1f1;
|
70 |
+
}
|
71 |
+
|
72 |
+
.config-cell {
|
73 |
+
max-width: 100px;
|
74 |
+
overflow-x: auto;
|
75 |
+
white-space: nowrap;
|
76 |
+
scrollbar-width: thin;
|
77 |
+
scrollbar-color: #888 #f1f1f1;
|
78 |
+
}
|
79 |
+
|
80 |
+
.hf-model-cell::-webkit-scrollbar,
|
81 |
+
.config-cell::-webkit-scrollbar {
|
82 |
+
height: 6px;
|
83 |
+
}
|
84 |
+
|
85 |
+
.hf-model-cell::-webkit-scrollbar-track,
|
86 |
+
.config-cell::-webkit-scrollbar-track {
|
87 |
+
background: #f1f1f1;
|
88 |
+
border-radius: 3px;
|
89 |
+
}
|
90 |
+
|
91 |
+
.hf-model-cell::-webkit-scrollbar-thumb,
|
92 |
+
.config-cell::-webkit-scrollbar-thumb {
|
93 |
+
background: #888;
|
94 |
+
border-radius: 3px;
|
95 |
+
}
|
96 |
+
|
97 |
+
.hf-model-cell::-webkit-scrollbar-thumb:hover,
|
98 |
+
.config-cell::-webkit-scrollbar-thumb:hover {
|
99 |
+
background: #555;
|
100 |
+
}
|
101 |
+
</style>
|
102 |
+
|
103 |
+
<div style="overflow-x: auto;">
|
104 |
+
|
105 |
+
| ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. |
|
106 |
+
| ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- |
|
107 |
+
| <div class="hf-model-cell">[egs2/geolid/lid1](https://github.com/espnet/espnet/tree/master/egs2/geolid/lid1)</div> | <div class="config-cell">`conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml`</div> | 94.4 | 95.4 | 97.7 | 88.6 | 86.8 | 99.0 | 93.7 |
|
108 |
+
|
109 |
+
</div>
|
110 |
|
111 |
## LID config
|
112 |
|
|
|
358 |
|
359 |
|
360 |
|
361 |
+
### Citation
|
362 |
|
363 |
```BibTex
|
364 |
+
@inproceedings{wang2025geolid,
|
365 |
+
author={Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, and Shinji Watanabe},
|
366 |
+
title={Geolocation-Aware Robust Spoken Language Identification},
|
367 |
+
year={2025},
|
368 |
+
booktitle={Procedings of ASRU},
|
369 |
+
}
|
370 |
+
|
371 |
@inproceedings{watanabe2018espnet,
|
372 |
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
|
373 |
title={{ESPnet}: End-to-End Speech Processing Toolkit},
|
|
|
377 |
doi={10.21437/Interspeech.2018-1456},
|
378 |
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
|
379 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
380 |
```
|