File size: 4,387 Bytes
1a10833 b5763d6 1a10833 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
language:
- nl
tags:
- automatic-speech-recognition
- mozilla-foundation/common_voice_8_0
- robust-speech-event
- model_for_talk
- nl
- nl_NL
- nl_BE
datasets:
- mozilla-foundation/common_voice_8_0
model-index:
- name: xls-r-nl-v1-cv8-lm
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 8
type: mozilla-foundation/common_voice_8_0
args: nl
metrics:
- name: Test WER
type: wer
value: 3.93
- name: Test CER
type: cer
value: 1.22
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event - Dev Data
type: speech-recognition-community-v2/dev_data
args: nl
metrics:
- name: Test WER
type: wer
value: 16.35
- name: Test CER
type: cer
value: 9.64
---
# XLS-R-based CTC model with 5-gram language model from Open Subtitles
This model is a version of [facebook/wav2vec2-xls-r-2b-22-to-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16) fine-tuned mainly on the [CGN dataset](https://taalmaterialen.ivdnt.org/download/tstc-corpus-gesproken-nederlands/), as well as the [MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - NL](https://commonvoice.mozilla.org) dataset (see details below), on which a large 5-gram language model is added based on the Open Subtitles Dutch corpus. This model achieves the following results on the evaluation set (of Common Voice 8.0):
- Wer: 0.03931
- Cer: 0.01224
> **IMPORTANT NOTE**: The `hunspell` typo fixer is **not enabled** on the website, which returns raw CTC+LM results. Hunspell reranking is only available in the `eval.py` decoding script. For best results, please use the code in that file while using the model locally for inference.
> **IMPORTANT NOTE**: Evaluating this model requires `apt install libhunspell-dev` and a pip install of `hunspell` in addition to pip installs of `pipy-kenlm` and `pyctcdecode` (see `install_requirements.sh`); in addition, the chunking lengths and strides were optimized for the model as `12s` and `2s` respectively (see `eval.sh`).
> **QUICK REMARK**: The "Robust Speech Event" set does not contain cleaned text, so its WER/CER are vastly over-estimated. For instance `2014` in the dev set is left as numbers but will be recognized as `tweeduizen veertien` which counts as 3 mistakes (`2014` missing, and both `tweeduizend` and `veertien` wrongly inserted). Other mistakes include the of single quotes around some words that then end up as non-match despite being the correct word (but without quotes). Real error rate on the dev set is significantly lower than reported.
## Model description
The model takes 16kHz sound input, and uses a Wav2Vec2ForCTC decoder with 48 letters to output the letter-transcription probabilities per frame.
To improve accuracy, a beam-search decoder based on `pyctcdecode` is then used; it reranks the most promising alignments based on a 5-gram language model trained on the Open Subtitles Dutch corpus.
To further deal with typos, `hunspell` is used to propose alternative spellings for words not in the unigrams of the language model. These alternatives are then reranked based on the language model trained above, and a penalty proportional to the levenshtein edit distance between the alternative and the recognized word. This for examples enables to correct `collegas` into `collega's` or `gogol` into `google`.
## Intended uses & limitations
This model can be used to transcribe Dutch or Flemish spoken dutch to text (without punctuation).
## Training and evaluation data
The model was:
0. initialized with [the 2B parameter model from Facebook](facebook/wav2vec2-xls-r-2b-22-to-16).
1. trained `5` epochs (6000 iterations of batch size 32) on [the `cv8/nl` dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0).
2. trained `1` epoch (36000 iterations of batch size 32) on [the `cgn` dataset](https://taalmaterialen.ivdnt.org/download/tstc-corpus-gesproken-nederlands/).
3. trained `5` epochs (6000 iterations of batch size 32) on [the `cv8/nl` dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0).
### Framework versions
- Transformers 4.16.0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0 |