File size: 4,330 Bytes
3bcf5ea
dc1a1b2
 
3bcf5ea
dc1a1b2
 
 
 
 
8fbd0a8
 
 
 
 
 
 
 
 
 
 
 
3bcf5ea
 
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
 
 
 
3bcf5ea
 
 
 
 
dc1a1b2
 
 
 
3bcf5ea
 
 
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
 
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
eb7449f
 
 
 
 
 
 
3bcf5ea
 
dc1a1b2
 
 
 
3bcf5ea
 
 
 
 
eb7449f
 
 
 
 
3bcf5ea
 
eb7449f
 
 
3bcf5ea
 
dc1a1b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bcf5ea
eb7449f
 
3bcf5ea
 
 
 
eb7449f
 
3bcf5ea
 
dc1a1b2
590f377
dc1a1b2
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
 
 
 
 
354d02b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
  - name: whisper-small-singlish-122k
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: NSC
          type: NSC
        metrics:
          - name: WER
            type: WER
            value: 9.69
---

# Whisper-small-singlish-122k.

This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).

The following results on the evaluation set (43,788k samples) are reported:

- Loss: 0.171377
- WER: 9.69

## Model Details

### Model Description

- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt)
- **Model type:** automatic-speech-recognition
- **License:** MIT
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)

## Uses

The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish). 

The recommended audio usage for testing should be:

1. Involves local Singapore slang, dialect, names, and terms etc.
2. Involves Singaporean accent.

### Direct Use

To use the model in an application, you can make use of `transformers`:

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
```

### Out-of-Scope Use

- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)

## Training Details

### Training Data

We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.

To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.

### Training Procedure

The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.

#### Training Hyperparameters

The following hyperparameters are used:

- **batch_size**: 128
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-5
- **warmup_steps**: 500
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 32
- **eval_step**: 500
- **max_grad_norm**: 1.0
- **generation_max_length**: 225

#### Training Results

| Steps | Epoch    | Train Loss | Eval Loss | WER                |
|:-----:|:--------:|:----------:|:---------:|:------------------:|
| 500   | 0.654450 | 0.7418     | 0.3889    | 17.968250          |
| 1000  | 1.308901 | 0.2831     | 0.2519    | 11.880948          |
| 1500  | 1.963351 | 0.1960     | 0.2038    | 9.948440           |
| 2000  | 2.617801 | 0.1236     | 0.1872    | 9.420248           |
| 2500  | 3.272251 | 0.0970     | 0.1791    | 8.539280           |
| 3000  | 3.926702 | 0.0728     | 0.1714    | 8.207827           |
| 3500  | 4.581152 | 0.0484     | 0.1741    | 8.145801           |
| 4000  | 5.235602 | 0.0401     | 0.1773    | 8.138047           |

The model with the lowest evaluation loss is used as the final checkpoint.

### Testing Data, Factors & Metrics

#### Testing Data

To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.

### Results

| Model                        | WER   |
|:----------------------------:|:-----:|
| fine-tuned-122k-whisper-small| 9.69% |

#### Summary

The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.

### Compute Infrastructure

Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/).

#### Hardware

- Single A6000 GPU

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[Low Wei Teck](mailto: jensenlwt@gmail.com)