File size: 4,330 Bytes

3bcf5ea
dc1a1b2
 
3bcf5ea
dc1a1b2
 
 
 
 
8fbd0a8
 
 
 
 
 
 
 
 
 
 
 
3bcf5ea
 
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
 
 
 
3bcf5ea
 
 
 
 
dc1a1b2
 
 
 
3bcf5ea
 
 
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
 
3bcf5ea
dc1a1b2
3bcf5ea
dc1a1b2
3bcf5ea
eb7449f
 
 
 
 
 
 
3bcf5ea
 
dc1a1b2
 
 
 
3bcf5ea
 
 
 
 
eb7449f
 
 
 
 
3bcf5ea
 
eb7449f
 
 
3bcf5ea
 
dc1a1b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bcf5ea
eb7449f
 
3bcf5ea
 
 
 
eb7449f
 
3bcf5ea
 
dc1a1b2
590f377
dc1a1b2
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
354d02b
3bcf5ea
 
 
 
 
 
 
354d02b

---
language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
  - name: whisper-small-singlish-122k
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: NSC
          type: NSC
        metrics:
          - name: WER
            type: WER
            value: 9.69
---

# Whisper-small-singlish-122k.

This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).

The following results on the evaluation set (43,788k samples) are reported:

- Loss: 0.171377
- WER: 9.69

## Model Details

### Model Description

- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt)
- **Model type:** automatic-speech-recognition
- **License:** MIT
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)

## Uses

The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish). 

The recommended audio usage for testing should be:

1. Involves local Singapore slang, dialect, names, and terms etc.
2. Involves Singaporean accent.

### Direct Use

To use the model in an application, you can make use of `transformers`:

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
```

### Out-of-Scope Use

- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)

## Training Details

### Training Data

We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.

To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.

### Training Procedure

The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.

#### Training Hyperparameters

The following hyperparameters are used:

- **batch_size**: 128
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-5
- **warmup_steps**: 500
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 32
- **eval_step**: 500
- **max_grad_norm**: 1.0
- **generation_max_length**: 225

#### Training Results

| Steps | Epoch    | Train Loss | Eval Loss | WER                |
|:-----:|:--------:|:----------:|:---------:|:------------------:|
| 500   | 0.654450 | 0.7418     | 0.3889    | 17.968250          |
| 1000  | 1.308901 | 0.2831     | 0.2519    | 11.880948          |
| 1500  | 1.963351 | 0.1960     | 0.2038    | 9.948440           |
| 2000  | 2.617801 | 0.1236     | 0.1872    | 9.420248           |
| 2500  | 3.272251 | 0.0970     | 0.1791    | 8.539280           |
| 3000  | 3.926702 | 0.0728     | 0.1714    | 8.207827           |
| 3500  | 4.581152 | 0.0484     | 0.1741    | 8.145801           |
| 4000  | 5.235602 | 0.0401     | 0.1773    | 8.138047           |

The model with the lowest evaluation loss is used as the final checkpoint.

### Testing Data, Factors & Metrics

#### Testing Data

To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.

### Results

| Model                        | WER   |
|:----------------------------:|:-----:|
| fine-tuned-122k-whisper-small| 9.69% |

#### Summary

The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.

### Compute Infrastructure

Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/).

#### Hardware

- Single A6000 GPU

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[Low Wei Teck](mailto: jensenlwt@gmail.com)