--- tags: - espnet - audio - automatic-speech-recognition language: et license: cc-by-4.0 --- # Estonian Espnet2 ASR model ## Model description This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. ## Intended uses & limitations This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. ## How to use ```python from espnet2.bin.asr_inference import Speech2Text model = Speech2Text.from_pretrained( "TalTechNLP/espnet2_estonian" ) # read a sound file with 16k sample rate import soundfile speech, rate = soundfile.read("speech.wav") assert rate == 16000 text, *_ = model(speech) print(text[0]) ``` #### Limitations and bias Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: * Speech containing technical and other domain-specific terms * Children's speech * Non-native speech * Speech recorded under very noisy conditions or with a microphone far from the speaker * Very spontaneous and overlapping speech ## Training data Acoustic training data: | Type | Amount (h) | |-----------------------|:------:| | Broadcast speech | 591 | | Spontaneous speech | 53 | | Elderly speech corpus | 53 | | Talks, lectures | 49 | | Parliament speeches | 31 | | *Total* | *761* | Language model training data: * Estonian National Corpus 2019 * OpenSubtitles * Speech transcripts ## Training procedure Standard EspNet2 Conformer recipe. ## Evaluation results ### WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset|2864|56575|93.1|4.5|2.4|2.0|8.9|63.4| |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset|273|4677|93.9|3.6|2.4|1.2|7.3|46.5| |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset|818|11093|94.7|2.7|2.5|0.9|6.2|45.0| |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset|1207|13865|82.3|8.5|9.3|3.4|21.2|74.1| |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset|1648|22707|86.4|7.6|6.0|2.5|16.1|75.7| ### BibTeX entry and citation info #### Citing ESPnet ```BibTex @inproceedings{watanabe2018espnet, author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, title={{ESPnet}: End-to-End Speech Processing Toolkit}, year={2018}, booktitle={Proceedings of Interspeech}, pages={2207--2211}, doi={10.21437/Interspeech.2018-1456}, url={http://dx.doi.org/10.21437/Interspeech.2018-1456} } @inproceedings{hayashi2020espnet, title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit}, author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu}, booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7654--7658}, year={2020}, organization={IEEE} } ```