File size: 4,430 Bytes
687c0f1
7a925a1
 
687c0f1
 
7a925a1
687c0f1
 
7a925a1
687c0f1
 
ce33427
1f1fd1f
687c0f1
7a925a1
687c0f1
 
 
1f1fd1f
687c0f1
7a925a1
 
687c0f1
471b750
687c0f1
 
1f1fd1f
5ef91c2
1f1fd1f
 
5ef91c2
1f1fd1f
ce33427
 
1f1fd1f
ce33427
 
 
 
 
 
 
1f1fd1f
ce33427
1f1fd1f
 
ce33427
1f1fd1f
687c0f1
 
7a925a1
687c0f1
8396ffd
471b750
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc641fc
471b750
 
 
 
 
 
 
 
 
 
8396ffd
ce33427
471b750
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce33427
 
471b750
 
ce33427
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
- cer
base_model: openai/whisper-large-v2
model-index:
- name: Whisper Large Portuguese
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 pt
      type: mozilla-foundation/common_voice_11_0
      config: pt
      split: test
      args: pt
    metrics:
    - type: wer
      value: 4.816664144852979
      name: WER
    - type: cer
      value: 1.6052355927195898
      name: CER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: google/fleurs pt_br
      type: google/fleurs
      config: pt_br
      split: test
      args: pt_br
    metrics:
    - type: wer
      value: 8.56762285333714
      name: WER
    - type: cer
      value: 5.462965196208485
      name: CER
---

# Whisper Large Portuguese

This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on Portuguese using the train and validation splits of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Not all validation split data were used during training, I extracted 1k samples from the validation split to be used for evaluation during fine-tuning.


## Usage

```python

from transformers import pipeline

transcriber = pipeline(
  "automatic-speech-recognition", 
  model="jonatasgrosman/whisper-large-pt-cv11"
)

transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="pt", 
    task="transcribe"
  )
)

transcription = transcriber("path/to/my_audio.wav")

```

## Evaluation

I've performed the evaluation of the model using the test split of two datasets, the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) (same dataset used for the fine-tuning) and the [Fleurs](https://huggingface.co/datasets/google/fleurs) (dataset not seen during the fine-tuning). As Whisper can transcribe casing and punctuation, I've performed the model evaluation in 2 different scenarios, one using the raw text and the other using the normalized text (lowercase + removal of punctuations). Additionally, for the Fleurs dataset, I've evaluated the model in a scenario where there are no transcriptions of numerical values since the way these values are described in this dataset is different from how they are described in the dataset used in fine-tuning (Common Voice), so it is expected that this difference in the way of describing numerical values will affect the performance of the model for this type of transcription in Fleurs.

### Common Voice 11

| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) | 2.52 | 9.56 |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) + text normalization | 1.60 | 4.82 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 4.32 | 13.92 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 2.84 | 7.02 |

### Fleurs

| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) | 4.88 | 12.08 |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) + text normalization | 5.46 | 8.57 |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) + keep only non-numeric samples | 2.35 | 9.00 |
| [jonatasgrosman/whisper-large-pt-cv11](https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11) + text normalization + keep only non-numeric samples | 3.36 | 6.05 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 3.52 | 10.55 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 4.19 | 7.04 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + keep only non-numeric samples | 2.61 | 9.29 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization + keep only non-numeric samples | 3.56 | 6.15 |