qingzhengwang commited on
Commit
308b176
·
1 Parent(s): 7afbcaa

Update README

Browse files
Files changed (1) hide show
  1. README.md +87 -26
README.md CHANGED
@@ -9,22 +9,31 @@ datasets:
9
  license: cc-by-4.0
10
  ---
11
 
12
- ## ESPnet2 LID model
13
 
14
  ### `espnet/geolid_combined_shared_trainable`
15
 
16
- This model was trained by Qingzheng-Wang using geolid recipe in [espnet](https://github.com/espnet/espnet/).
17
 
18
- ### Demo: How to use in ESPnet2
 
 
 
19
 
20
- Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
- if you haven't done that already.
 
 
 
 
 
 
 
22
 
23
  ```bash
24
  cd espnet
25
 
26
  pip install -e .
27
-
28
  cd egs2/geolid/lid1
29
 
30
  # Download the exp_combined to egs2/geolid/lid1
@@ -33,7 +42,71 @@ hf download espnet/geolid_combined_shared_trainable --local-dir . --exclude "REA
33
  ./run_combined.sh --skip_data_prep false --skip_train true
34
  ```
35
 
 
 
 
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## LID config
39
 
@@ -285,9 +358,16 @@ distributed: false
285
 
286
 
287
 
288
- ### Citing ESPnet
289
 
290
  ```BibTex
 
 
 
 
 
 
 
291
  @inproceedings{watanabe2018espnet,
292
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
293
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
@@ -297,23 +377,4 @@ distributed: false
297
  doi={10.21437/Interspeech.2018-1456},
298
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
299
  }
300
-
301
-
302
-
303
-
304
-
305
-
306
- ```
307
-
308
- or arXiv:
309
-
310
- ```bibtex
311
- @misc{watanabe2018espnet,
312
- title={ESPnet: End-to-End Speech Processing Toolkit},
313
- author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
314
- year={2018},
315
- eprint={1804.00015},
316
- archivePrefix={arXiv},
317
- primaryClass={cs.CL}
318
- }
319
  ```
 
9
  license: cc-by-4.0
10
  ---
11
 
12
+ ## ESPnet2 Spoken Language Identification (LID) model
13
 
14
  ### `espnet/geolid_combined_shared_trainable`
15
 
16
+ This geolocation-aware language identification (LID) model is developed using the [ESPnet](https://github.com/espnet/espnet/) toolkit. It integrates the powerful pretrained MMS-1B model ([facebook/mms-1b](https://huggingface.co/facebook/mms-1b)) as the encoder and employs ECAPA-TDNN ([arXiv](https://arxiv.org/pdf/2005.07143)) as the embedding extractor to achieve robust spoken language identification.
17
 
18
+ The main innovations of this model are:
19
+ 1. Incorporating geolocation prediction as an auxiliary task during training.
20
+ 2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information.
21
+ This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations.
22
 
23
+ For further details on the geolocation-aware LID methodology, please refer to our paper: *Geolocation-Aware Robust Spoken Language Identification* (arXiv link to be added).
24
+
25
+ ### Usage Guide: How to use in ESPnet2
26
+
27
+ #### Prerequisites
28
+ First, ensure you have ESPnet installed. If not, follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html).
29
+
30
+ #### Quick Start
31
+ Run the following commands to set up and use the pre-trained model:
32
 
33
  ```bash
34
  cd espnet
35
 
36
  pip install -e .
 
37
  cd egs2/geolid/lid1
38
 
39
  # Download the exp_combined to egs2/geolid/lid1
 
42
  ./run_combined.sh --skip_data_prep false --skip_train true
43
  ```
44
 
45
+ This will download the pre-trained model and run inference using the VoxLingua107 test data.
46
+
47
+ ### Train and Evaluation Datasets
48
 
49
+ The training utilized a combined dataset, merging five domain-specific corpora, resulting in 9,865 hours of speech data covering 157 languages.
50
+
51
+ | Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) |
52
+ | ------------- | ----------- | ------------------ | ------- | --------------------------- |
53
+ | [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) | YouTube | 107/33 | No | Seen |
54
+ | [Babel](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=31a13cefb42647e924e0d2778d341decc44c40e9) | Telephone | 25/25 | No | Seen |
55
+ | [FLEURS](https://huggingface.co/datasets/google/xtreme_s) | Read speech | 102/102 | No | Seen |
56
+ | [ML-SUPERB 2.0](https://huggingface.co/datasets/espnet/ml_superb_hf) | Mixed | 137/(137, 8) | Yes | Seen |
57
+ | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Parliament | 16/16 | No | Seen |
58
+
59
+ ### Results
60
+
61
+ **Accuracy (%) on In-domain and Out-of-domain Test Sets**
62
+
63
+ <style>
64
+ .hf-model-cell {
65
+ max-width: 120px;
66
+ overflow-x: auto;
67
+ white-space: nowrap;
68
+ scrollbar-width: thin;
69
+ scrollbar-color: #888 #f1f1f1;
70
+ }
71
+
72
+ .config-cell {
73
+ max-width: 100px;
74
+ overflow-x: auto;
75
+ white-space: nowrap;
76
+ scrollbar-width: thin;
77
+ scrollbar-color: #888 #f1f1f1;
78
+ }
79
+
80
+ .hf-model-cell::-webkit-scrollbar,
81
+ .config-cell::-webkit-scrollbar {
82
+ height: 6px;
83
+ }
84
+
85
+ .hf-model-cell::-webkit-scrollbar-track,
86
+ .config-cell::-webkit-scrollbar-track {
87
+ background: #f1f1f1;
88
+ border-radius: 3px;
89
+ }
90
+
91
+ .hf-model-cell::-webkit-scrollbar-thumb,
92
+ .config-cell::-webkit-scrollbar-thumb {
93
+ background: #888;
94
+ border-radius: 3px;
95
+ }
96
+
97
+ .hf-model-cell::-webkit-scrollbar-thumb:hover,
98
+ .config-cell::-webkit-scrollbar-thumb:hover {
99
+ background: #555;
100
+ }
101
+ </style>
102
+
103
+ <div style="overflow-x: auto;">
104
+
105
+ | ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. |
106
+ | ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- |
107
+ | <div class="hf-model-cell">[egs2/geolid/lid1](https://github.com/espnet/espnet/tree/master/egs2/geolid/lid1)</div> | <div class="config-cell">`conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml`</div> | 94.4 | 95.4 | 97.7 | 88.6 | 86.8 | 99.0 | 93.7 |
108
+
109
+ </div>
110
 
111
  ## LID config
112
 
 
358
 
359
 
360
 
361
+ ### Citation
362
 
363
  ```BibTex
364
+ @inproceedings{wang2025geolid,
365
+ author={Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, and Shinji Watanabe},
366
+ title={Geolocation-Aware Robust Spoken Language Identification},
367
+ year={2025},
368
+ booktitle={Procedings of ASRU},
369
+ }
370
+
371
  @inproceedings{watanabe2018espnet,
372
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
373
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
 
377
  doi={10.21437/Interspeech.2018-1456},
378
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
379
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
380
  ```