--- license: cc-by-4.0 tags: - audio - automatic-speech-recognition - hf-asr-leaderboard language: et model-index: - name: TalTechNLP/whisper-large-et results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11 type: mozilla-foundation/common_voice_11_0 config: et split: test metrics: - name: Test WER type: wer value: 12.03 - name: Test CER type: cer value: 3.18 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 8 type: mozilla-foundation/common_voice_8_0 config: et split: test metrics: - name: Test WER type: wer value: 11.35 - name: Test CER type: cer value: 2.75 --- # Whisper-large-et This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data. ## Model description This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. ## Intended uses & limitations This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. ## How to use Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper). For example: * Convert the HF model to CT2 format: `ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16` * Decode: `whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3` #### Limitations and bias Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: * Speech containing technical and other domain-specific terms * Children's speech * Non-native speech * Speech recorded under very noisy conditions or with a microphone far from the speaker * Very spontaneous and overlapping speech ## Training data Acoustic training data: | Type | Amount (h) | |-----------------------|:------:| | Broadcast speech | 991 | | Spontaneous speech | 53 | | Elderly speech corpus | 53 | | Talks, lectures | 49 | | Parliament speeches | 31 | | *Total* | *1161* | ## Training procedure Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model. Finetuning was done for 3 epochs, with model averaging at the end of training. *Update*: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation). ## Evaluation results ### WER WER results below are obtained using greedy decoding (i.e., beam size 1). |Dataset | WER | |---|---| | Common Voice 8.0 | 11.3 | | Common Voice 11.0 | 12.0 |