|
--- |
|
language: id |
|
datasets: |
|
- mozilla-foundation/common_voice_8_0 |
|
metrics: |
|
- wer |
|
--- |
|
|
|
# wav2vec 2.0 XLSR-53 Model |
|
|
|
This is the [wav2vec 2.0 XLSR-53 model](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) fine-tuned on the [Common Voice 8.0 datasets](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) for Bahasa Indonesia using the `train`, `validation`, and `other` splits (~32.000 sound samples). This model was used for research purposes to complete my Undergraduate Thesis. |
|
|
|
## Preprocessing |
|
1. Removal of symbols from transcript |
|
2. Convert numbers (0, 1, ..., 9) to word forms (satu, dua, ..., sembilan) |
|
3. Convert all characters to lowercase |
|
2. Resample the audio data to 16kHz. |
|
3. Uses data collator from [this example](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) |
|
|
|
## Hyperparameters used |
|
- Learning rate = 1e-4 |
|
- Maximum Epochs = 30 |
|
- Batch size = 4 (limitations of compute resource) |
|
- Early stopping = Stop when WER doesn't improve for 2 validations |
|
- Other parameters use the defaults from [this config](https://huggingface.co/docs/transformers/v4.20.1/en/model_doc/wav2vec2#overview) |
|
|
|
## Results |
|
The results are an average of 5 runs using the `test` split from the Common Voice datasets for Bahasa Indonesia. |
|
|
|
**Test Result: 15,6% WER** |
|
|
|
## References |
|
- [Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) |
|
- [Wav2Vec2-Large-XLSR-Indonesian by Indonesian NLP](https://huggingface.co/indonesian-nlp/wav2vec2-large-xlsr-indonesian-baseline) |
|
|