File size: 3,002 Bytes
8301504
 
 
 
 
 
 
 
 
7f3cc26
8301504
7f3cc26
8301504
 
7f3cc26
8301504
7f3cc26
 
 
 
 
 
 
 
 
34c2de6
 
7f3cc26
 
214cbd9
 
7f3cc26
214cbd9
7f3cc26
214cbd9
7f3cc26
 
 
 
a356678
 
 
 
 
 
 
7f3cc26
214cbd9
 
5cd2240
214cbd9
 
 
 
 
 
 
 
 
 
 
 
7f3cc26
 
 
214cbd9
7f3cc26
 
 
214cbd9
 
 
 
 
 
 
 
 
7f3cc26
 
 
 
 
ef88c7a
7f3cc26
8301504
 
 
 
 
 
 
 
 
 
64f9e5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
tags:
- espnet
- audio
- automatic-speech-recognition
language: et
license: cc-by-4.0
---

# Estonian Espnet2 ASR model

## Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

## Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.


## How to use
```python

from espnet2.bin.asr_inference import Speech2Text
    
model = Speech2Text.from_pretrained(
  "TalTechNLP/espnet2_estonian", 
  lm_weight=0.6, ctc_weight=0.4, beam_size=60
)

# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])
```

#### Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
  * Speech containing technical and other domain-specific terms
  * Children's speech
  * Non-native speech
  * Speech recorded under very noisy conditions or with a microphone far from the speaker
  * Very spontaneous and overlapping speech

## Training data
Acoustic training data:

| Type                  | Amount (h) |
|-----------------------|:------:|
| Broadcast speech      |   591  |
| Spontaneous speech    |   53   |
| Elderly speech corpus |   53   |
| Talks, lectures       |   49   |
| Parliament speeches   |   31   |
| *Total*                 |   *761*  |

Language model training data:
 * Estonian National Corpus 2019
 * OpenSubtitles
 * Speech transcripts

## Training procedure

Standard EspNet2 Conformer recipe.

## Evaluation results

### WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset|2864|56575|93.1|4.5|2.4|2.0|8.9|63.4|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset|273|4677|93.9|3.6|2.4|1.2|7.3|46.5|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset|818|11093|94.7|2.7|2.5|0.9|6.2|45.0|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset|1207|13865|82.3|8.5|9.3|3.4|21.2|74.1|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset|1648|22707|86.4|7.6|6.0|2.5|16.1|75.7|


### BibTeX entry and citation info



#### Citing ESPnet
```BibTex
@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
```