---
license: cc-by-4.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
language: et
model-index:
- name: TalTechNLP/whisper-large-et
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11
      type: mozilla-foundation/common_voice_11_0
      config: et
      split: test
    metrics:
    - name: Test WER
      type: wer
      value: 12.03
    - name: Test CER
      type: cer
      value: 3.18
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 8
      type: mozilla-foundation/common_voice_8_0
      config: et
      split: test
    metrics:
    - name: Test WER
      type: wer
      value: 11.35
    - name: Test CER
      type: cer
      value: 2.75
---


# Whisper-large-et

This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data.

## Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.


## Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

## How to use

Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper).

For example:

  * Convert the HF model to CT2 format: 
  
    `ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2  --copy_files tokenizer.json --quantization float16`
    
  * Decode: 
  
    `whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3`


#### Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
  * Speech containing technical and other domain-specific terms
  * Children's speech
  * Non-native speech
  * Speech recorded under very noisy conditions or with a microphone far from the speaker
  * Very spontaneous and overlapping speech

## Training data
Acoustic training data:

| Type                  | Amount (h) |
|-----------------------|:------:|
| Broadcast speech      |   991  |
| Spontaneous speech    |   53   |
| Elderly speech corpus |   53   |
| Talks, lectures       |   49   |
| Parliament speeches   |   31   |
| *Total*               |   *1161*  |


## Training procedure

Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. 
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
Finetuning was done for 3 epochs, with model averaging at the end of training.

*Update*: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and
is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to
transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).

## Evaluation results

### WER

WER results below are obtained using greedy decoding (i.e., beam size 1).

|Dataset | WER |
|---|---|
| Common Voice 8.0 | 11.3 |
| Common Voice 11.0 | 12.0 |