File size: 4,218 Bytes
8d717e5
1828d04
 
8d717e5
 
1828d04
8d717e5
 
1828d04
8d717e5
 
863305b
8d717e5
1828d04
8d717e5
 
 
 
 
1828d04
 
8d717e5
863305b
8d717e5
 
863305b
8d717e5
863305b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d717e5
 
907a582
8d717e5
6c23e25
8d717e5
863305b
8d717e5
863305b
8d717e5
863305b
8d717e5
863305b
 
 
 
8d717e5
863305b
 
907a582
863305b
 
 
8d717e5
863305b
8d717e5
863305b
8d717e5
863305b
8d717e5
6c23e25
8d717e5
863305b
8d717e5
863305b
 
 
 
 
 
8d717e5
863305b
8d717e5
863305b
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language:
- es
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
- cer
model-index:
- name: Whisper Large Spanish
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 es
      type: mozilla-foundation/common_voice_11_0
      config: es
      split: test
      args: es
    metrics:
    - name: WER
      type: wer
      value: 4.673613637544826
    - name: CER
      type: cer
      value: 1.5573247819517182
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: google/fleurs es_419
      type: google/fleurs
      config: es_419
      split: test
      args: es_419
    metrics:
    - name: WER
      type: wer
      value: 5.396216546072705
    - name: CER
      type: cer
      value: 3.450427960057061
---

# Whisper Large Spanish

This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on Spanish using the train split of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).

## Usage

```python

from transformers import pipeline

transcriber = pipeline(
  "automatic-speech-recognition", 
  model="jonatasgrosman/whisper-large-es-cv11"
)

transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="es", 
    task="transcribe"
  )
)

transcription = transcriber("path/to/my_audio.wav")

```

## Evaluation

I've performed the evaluation of the model using the test split of two datasets, the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) (same dataset used for the fine-tuning) and the [Fleurs](https://huggingface.co/datasets/google/fleurs) (dataset not seen during the fine-tuning). As Whisper can transcribe casing and punctuation, I've performed the model evaluation in 2 different scenarios, one using the raw text and the other using the normalized text (lowercase + removal of punctuations). Additionally, for the Fleurs dataset, I've evaluated the model in a scenario where there are no transcriptions of numerical values since the way these values are described in this dataset is different from how they are described in the dataset used in fine-tuning (Common Voice), so it is expected that this difference in the way of describing numerical values will affect the performance of the model for this type of transcription in Fleurs.

### Common Voice 11

| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) | 2.43 | 8.85 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization | 1.56 | 4.67 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 3.71 | 12.34 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 2.45 | 6.30 |

### Fleurs

| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) | 3.06 | 9.11 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization | 3.45 | 5.40 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + keep only non-numeric samples | 1.83 | 7.57 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization + keep only non-numeric samples | 2.36 | 4.14 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 2.30 | 8.50 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 2.76 | 4.79 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + keep only non-numeric samples | 1.93 | 7.33 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization + keep only non-numeric samples | 2.50 | 4.28 |