espnet
/

geolid_combined_shared_trainable

ESPnet

TensorBoard

audio

language-identification

Model card Files Files and versions

xet

Metrics Training metrics Community

qingzhengwang commited on Aug 19

Commit

308b176

1 Parent(s): 7afbcaa

Update README

Browse files

Files changed (1) hide show

README.md +87 -26

README.md CHANGED Viewed

@@ -9,22 +9,31 @@ datasets:
 license: cc-by-4.0
 ---
-## ESPnet2 LID model
 ### `espnet/geolid_combined_shared_trainable`
-This model was trained by Qingzheng-Wang using geolid recipe in [espnet](https://github.com/espnet/espnet/).
-### Demo: How to use in ESPnet2
-Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
-if you haven't done that already.
 ```bash
 cd espnet
 pip install -e .
 cd egs2/geolid/lid1
 # Download the exp_combined to egs2/geolid/lid1
@@ -33,7 +42,71 @@ hf download espnet/geolid_combined_shared_trainable --local-dir . --exclude "REA
 ./run_combined.sh --skip_data_prep false --skip_train true
 ```
 ## LID config
@@ -285,9 +358,16 @@ distributed: false
-### Citing ESPnet
 ```BibTex
 @inproceedings{watanabe2018espnet,
   author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
   title={{ESPnet}: End-to-End Speech Processing Toolkit},
@@ -297,23 +377,4 @@ distributed: false
   doi={10.21437/Interspeech.2018-1456},
   url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
 }
-```
-or arXiv:
-```bibtex
-@misc{watanabe2018espnet,
-  title={ESPnet: End-to-End Speech Processing Toolkit},
-  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
-  year={2018},
-  eprint={1804.00015},
-  archivePrefix={arXiv},
-  primaryClass={cs.CL}
-}
 ```

 license: cc-by-4.0
 ---
+## ESPnet2 Spoken Language Identification (LID) model
 ### `espnet/geolid_combined_shared_trainable`
+This geolocation-aware language identification (LID) model is developed using the [ESPnet](https://github.com/espnet/espnet/) toolkit. It integrates the powerful pretrained MMS-1B model ([facebook/mms-1b](https://huggingface.co/facebook/mms-1b)) as the encoder and employs ECAPA-TDNN ([arXiv](https://arxiv.org/pdf/2005.07143)) as the embedding extractor to achieve robust spoken language identification.
+The main innovations of this model are:
+1. Incorporating geolocation prediction as an auxiliary task during training.
+2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information.
+This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations.
+For further details on the geolocation-aware LID methodology, please refer to our paper: *Geolocation-Aware Robust Spoken Language Identification* (arXiv link to be added).
+### Usage Guide: How to use in ESPnet2
+#### Prerequisites
+First, ensure you have ESPnet installed. If not, follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html).
+#### Quick Start
+Run the following commands to set up and use the pre-trained model:
 ```bash
 cd espnet
 pip install -e .
 cd egs2/geolid/lid1
 # Download the exp_combined to egs2/geolid/lid1
 ./run_combined.sh --skip_data_prep false --skip_train true
 ```
+This will download the pre-trained model and run inference using the VoxLingua107 test data.
+### Train and Evaluation Datasets
+The training utilized a combined dataset, merging five domain-specific corpora, resulting in 9,865 hours of speech data covering 157 languages.
+| Dataset       | Domain      | #Langs. Train/Test | Dialect | Training Setup (VL107-only) |
+| ------------- | ----------- | ------------------ | ------- | --------------------------- |
+| [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/)  | YouTube     | 107/33             | No      | Seen                        |
+| [Babel](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=31a13cefb42647e924e0d2778d341decc44c40e9)         | Telephone   | 25/25              | No      | Seen                      |
+| [FLEURS](https://huggingface.co/datasets/google/xtreme_s)        | Read speech | 102/102            | No      | Seen                      |
+| [ML-SUPERB 2.0](https://huggingface.co/datasets/espnet/ml_superb_hf) | Mixed       | 137/(137, 8)       | Yes     | Seen                      |
+| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)     | Parliament  | 16/16              | No      | Seen                      |
+### Results
+**Accuracy (%) on In-domain and Out-of-domain Test Sets**
+<style>
+.hf-model-cell {
+    max-width: 120px;
+    overflow-x: auto;
+    white-space: nowrap;
+    scrollbar-width: thin;
+    scrollbar-color: #888 #f1f1f1;
+}
+.config-cell {
+    max-width: 100px;
+    overflow-x: auto;
+    white-space: nowrap;
+    scrollbar-width: thin;
+    scrollbar-color: #888 #f1f1f1;
+}
+.hf-model-cell::-webkit-scrollbar,
+.config-cell::-webkit-scrollbar {
+    height: 6px;
+}
+.hf-model-cell::-webkit-scrollbar-track,
+.config-cell::-webkit-scrollbar-track {
+    background: #f1f1f1;
+    border-radius: 3px;
+}
+.hf-model-cell::-webkit-scrollbar-thumb,
+.config-cell::-webkit-scrollbar-thumb {
+    background: #888;
+    border-radius: 3px;
+}
+.hf-model-cell::-webkit-scrollbar-thumb:hover,
+.config-cell::-webkit-scrollbar-thumb:hover {
+    background: #555;
+}
+</style>
+<div style="overflow-x: auto;">
+| ESPnet Recipe                    | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. |
+| ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- |
+| <div class="hf-model-cell">[egs2/geolid/lid1](https://github.com/espnet/espnet/tree/master/egs2/geolid/lid1)</div> | <div class="config-cell">`conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml`</div> | 94.4         | 95.4  | 97.7   | 88.6             | 86.8                 | 99.0      | 93.7       |
+</div>
 ## LID config
+### Citation
 ```BibTex
+@inproceedings{wang2025geolid,
+  author={Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, and Shinji Watanabe},
+  title={Geolocation-Aware Robust Spoken Language Identification},
+  year={2025},
+  booktitle={Procedings of ASRU},
+}
 @inproceedings{watanabe2018espnet,
   author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
   title={{ESPnet}: End-to-End Speech Processing Toolkit},
   doi={10.21437/Interspeech.2018-1456},
   url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
 }
 ```