5.85 kB
--- | |
language: multilingual | |
thumbnail: | |
tags: | |
- audio-classification | |
- speechbrain | |
- embeddings | |
- Language | |
- Identification | |
- pytorch | |
- ECAPA-TDNN | |
- TDNN | |
- VoxLingua107 | |
license: "apache-2.0" | |
datasets: | |
- VoxLingua107 | |
metrics: | |
- Accuracy | |
widget: | |
- example_title: English Sample | |
src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac | |
--- | |
# VoxLingua107 ECAPA-TDNN Spoken Language Identification Model | |
## Model description | |
This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. | |
The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. | |
The model can classify a speech utterance according to the language spoken. | |
It covers 107 different languages ( | |
Abkhazian, | |
Afrikaans, | |
Amharic, | |
Arabic, | |
Assamese, | |
Azerbaijani, | |
Bashkir, | |
Belarusian, | |
Bulgarian, | |
Bengali, | |
Tibetan, | |
Breton, | |
Bosnian, | |
Catalan, | |
Cebuano, | |
Czech, | |
Welsh, | |
Danish, | |
German, | |
Greek, | |
English, | |
Esperanto, | |
Spanish, | |
Estonian, | |
Basque, | |
Persian, | |
Finnish, | |
Faroese, | |
French, | |
Galician, | |
Guarani, | |
Gujarati, | |
Manx, | |
Hausa, | |
Hawaiian, | |
Hindi, | |
Croatian, | |
Haitian, | |
Hungarian, | |
Armenian, | |
Interlingua, | |
Indonesian, | |
Icelandic, | |
Italian, | |
Hebrew, | |
Japanese, | |
Javanese, | |
Georgian, | |
Kazakh, | |
Central Khmer, | |
Kannada, | |
Korean, | |
Latin, | |
Luxembourgish, | |
Lingala, | |
Lao, | |
Lithuanian, | |
Latvian, | |
Malagasy, | |
Maori, | |
Macedonian, | |
Malayalam, | |
Mongolian, | |
Marathi, | |
Malay, | |
Maltese, | |
Burmese, | |
Nepali, | |
Dutch, | |
Norwegian Nynorsk, | |
Norwegian, | |
Occitan, | |
Panjabi, | |
Polish, | |
Pushto, | |
Portuguese, | |
Romanian, | |
Russian, | |
Sanskrit, | |
Scots, | |
Sindhi, | |
Sinhala, | |
Slovak, | |
Slovenian, | |
Shona, | |
Somali, | |
Albanian, | |
Serbian, | |
Sundanese, | |
Swedish, | |
Swahili, | |
Tamil, | |
Telugu, | |
Tajik, | |
Thai, | |
Turkmen, | |
Tagalog, | |
Turkish, | |
Tatar, | |
Ukrainian, | |
Urdu, | |
Uzbek, | |
Vietnamese, | |
Waray, | |
Yiddish, | |
Yoruba, | |
Mandarin Chinese). | |
## Intended uses & limitations | |
The model has two uses: | |
- use 'as is' for spoken language recognition | |
- use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data | |
The model is trained on automatically collected YouTube data. For more | |
information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/). | |
#### How to use | |
```python | |
import torchaudio | |
from speechbrain.pretrained import EncoderClassifier | |
language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn", savedir="tmp") | |
# Download Thai language sample from Omniglot and cvert to suitable form | |
signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_th.mp3") | |
prediction = language_id.classify_batch(signal) | |
print(prediction) | |
(tensor([[0.3210, 0.3751, 0.3680, 0.3939, 0.4026, 0.3644, 0.3689, 0.3597, 0.3508, | |
0.3666, 0.3895, 0.3978, 0.3848, 0.3957, 0.3949, 0.3586, 0.4360, 0.3997, | |
0.4106, 0.3886, 0.4177, 0.3870, 0.3764, 0.3763, 0.3672, 0.4000, 0.4256, | |
0.4091, 0.3563, 0.3695, 0.3320, 0.3838, 0.3850, 0.3867, 0.3878, 0.3944, | |
0.3924, 0.4063, 0.3803, 0.3830, 0.2996, 0.4187, 0.3976, 0.3651, 0.3950, | |
0.3744, 0.4295, 0.3807, 0.3613, 0.4710, 0.3530, 0.4156, 0.3651, 0.3777, | |
0.3813, 0.6063, 0.3708, 0.3886, 0.3766, 0.4023, 0.3785, 0.3612, 0.4193, | |
0.3720, 0.4406, 0.3243, 0.3866, 0.3866, 0.4104, 0.4294, 0.4175, 0.3364, | |
0.3595, 0.3443, 0.3565, 0.3776, 0.3985, 0.3778, 0.2382, 0.4115, 0.4017, | |
0.4070, 0.3266, 0.3648, 0.3888, 0.3907, 0.3755, 0.3631, 0.4460, 0.3464, | |
0.3898, 0.3661, 0.3883, 0.3772, 0.9289, 0.3687, 0.4298, 0.4211, 0.3838, | |
0.3521, 0.3515, 0.3465, 0.4772, 0.4043, 0.3844, 0.3973, 0.4343]]), tensor([0.9289]), tensor([94]), ['th']) | |
# The scores in the prediction[0] tensor can be interpreted as cosine scores between | |
# the languages and the given utterance (i.e., the larger the better) | |
# The identified language ISO code is given in prediction[3] | |
print(prediction[3]) | |
['th'] | |
# Alternatively, use the utterance embedding extractor: | |
emb = language_id.encode_batch(signal) | |
print(emb.shape) | |
torch.Size([1, 1, 256]) | |
``` | |
#### Limitations and bias | |
Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are: | |
- Probably it's accuracy on smaller languages is quite limited | |
- Probably it works worse on female speech than male speech (because YouTube data includes much more male speech) | |
- Based on subjective experiments, it doesn't work well on speech with a foreign accent | |
- Probably it doesn't work well on children's speech and on persons with speech disorders | |
## Training data | |
The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/). | |
VoxLingua107 is a speech dataset for training spoken language identification models. | |
The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. | |
VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours. | |
The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language. | |
## Training procedure | |
We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model. | |
Training recipe will be published soon. | |
## Evaluation results | |
Error rate: 7% on the development dataset | |
### BibTeX entry and citation info | |
```bibtex | |
@inproceedings{valk2021slt, | |
title={{VoxLingua107}: a Dataset for Spoken Language Recognition}, | |
author={J{\"o}rgen Valk and Tanel Alum{\"a}e}, | |
booktitle={Proc. IEEE SLT Workshop}, | |
year={2021}, | |
} | |
``` | |