File size: 4,421 Bytes
416fb0f
766a28e
 
416fb0f
 
766a28e
416fb0f
 
766a28e
9d19e49
 
416fb0f
 
d7048bf
416fb0f
766a28e
416fb0f
 
 
d7048bf
416fb0f
766a28e
 
416fb0f
 
 
d7048bf
1c4bdb9
d7048bf
a50a16d
 
d7048bf
a50a16d
 
d7048bf
a50a16d
 
 
d7048bf
a50a16d
d7048bf
416fb0f
 
766a28e
416fb0f
a50a16d
 
 
 
9d19e49
1c4bdb9
a50a16d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416fb0f
 
 
 
 
 
 
 
 
 
 
 
 
 
9d19e49
 
 
416fb0f
 
 
 
9d19e49
416fb0f
 
 
 
9d19e49
 
1c4bdb9
 
 
 
 
 
 
 
 
 
416fb0f
a50a16d
 
 
 
 
 
 
 
 
 
 
 
5150908
a50a16d
 
416fb0f
a50a16d
 
 
 
5150908
a50a16d
 
416fb0f
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
- id
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
- magic_data
- TITML
metrics:
- wer
base_model: openai/whisper-medium
model-index:
- name: Whisper Medium Indonesian
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 id
      type: mozilla-foundation/common_voice_11_0
      config: id
      split: test
    metrics:
    - type: wer
      value: 3.8273540533062804
      name: Wer
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: google/fleurs id_id
      type: google/fleurs
      config: id_id
      split: test
    metrics:
    - type: wer
      value: 9.74
      name: Wer
---

# Whisper Medium Indonesian

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the 
Indonesian mozilla-foundation/common_voice_11_0, magic_data, titml and google/fleurs dataset. It achieves the following 
results:
### CV11 test split:
- Loss: 0.0698
- Wer: 3.8274
### Google/fleurs test split:
- Wer: 9.74

## Usage

```python
from transformers import pipeline
transcriber = pipeline(
  "automatic-speech-recognition", 
  model="cahya/whisper-medium-id"
)
transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="id" 
    task="transcribe"
  )
)
transcription = transcriber("my_audio_file.mp3")
```

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-06
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 10000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Wer    |
|:-------------:|:-----:|:-----:|:---------------:|:------:|
| 0.0427        | 0.33  | 1000  | 0.0664          | 4.3807 |
| 0.042         | 0.66  | 2000  | 0.0658          | 3.9426 |
| 0.0265        | 0.99  | 3000  | 0.0657          | 3.8274 |
| 0.0211        | 1.32  | 4000  | 0.0679          | 3.8366 |
| 0.0212        | 1.66  | 5000  | 0.0682          | 3.8412 |
| 0.0206        | 1.99  | 6000  | 0.0683          | 3.8689 |
| 0.0166        | 2.32  | 7000  | 0.0711          | 3.9657 |
| 0.0095        | 2.65  | 8000  | 0.0717          | 3.9980 |
| 0.0122        | 2.98  | 9000  | 0.0714          | 3.9795 |
| 0.0049        | 3.31  | 10000 | 0.0720          | 3.9887 |

## Evaluation

We evaluated the model using the test split of two datasets, the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) 
and the [Google Fleurs](https://huggingface.co/datasets/google/fleurs). 
As Whisper can transcribe casing and punctuation, we also evaluate its performance using raw and normalized text.
(lowercase + removal of punctuations). The results are as follows:

### Common Voice 11

|                                                                           | WER  |
|---------------------------------------------------------------------------|------|
| [cahya/whisper-medium-id](https://huggingface.co/cahya/whisper-medium-id) | 3.83 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)     | 12.62  |

### Google/Fleurs

|                                                                                                             | WER  |
|-------------------------------------------------------------------------------------------------------------|------|
| [cahya/whisper-medium-id](https://huggingface.co/cahya/whisper-medium-id)                      | 9.74 |
| [cahya/whisper-medium-id](https://huggingface.co/cahya/whisper-medium-id) + text normalization | tbc  |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                       | 10.2 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) + text normalization                  | tbc  |
| 
### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2