Estonian Espnet2 ASR model

Model description

This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

How to use


from espnet2.bin.asr_inference import Speech2Text
    
model = Speech2Text.from_pretrained(
  "TalTechNLP/espnet2_estonian", 
  lm_weight=0.6, ctc_weight=0.4, beam_size=60
)

# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

  • Speech containing technical and other domain-specific terms
  • Children's speech
  • Non-native speech
  • Speech recorded under very noisy conditions or with a microphone far from the speaker
  • Very spontaneous and overlapping speech

Training data

Acoustic training data:

Type Amount (h)
Broadcast speech 591
Spontaneous speech 53
Elderly speech corpus 53
Talks, lectures 49
Parliament speeches 31
Total 761

Language model training data:

  • Estonian National Corpus 2019
  • OpenSubtitles
  • Speech transcripts

Training procedure

Standard EspNet2 Conformer recipe.

Evaluation results

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset 2864 56575 93.1 4.5 2.4 2.0 8.9 63.4
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset 273 4677 93.9 3.6 2.4 1.2 7.3 46.5
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset 818 11093 94.7 2.7 2.5 0.9 6.2 45.0
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset 1207 13865 82.3 8.5 9.3 3.4 21.2 74.1
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset 1648 22707 86.4 7.6 6.0 2.5 16.1 75.7

BibTeX entry and citation info

Citing ESPnet

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
Downloads last month
72
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.