fxtentacle
/

wav2vec2-xls-r-1b-tevr

+---
+language: de
+datasets:
+- common_voice
+metrics:
+- wer
+- cer
+tags:
+- audio
+- automatic-speech-recognition
+- speech
+- hf-asr-leaderboard
+license: apache-2.0
+model-index:
+- name: wav2vec 2.0 XLS-R 1B + TEVR tokens + 5-gram LM by Hajo Nils Krabbenhöft
+  results:
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice de
+      type: common_voice
+      args: de
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 3.6433399042523233
+       - name: Test CER
+         type: cer
+         value: 1.5398893560981173
+---
+## Overview
+This folder contains a fully trained German speech recognition pipeline
+consisting of an acoustic model using the new wav2vec 2.0 XLS-R 1B TEVR architecture
+and a 5-gram KenLM language model.
+For an explanation of the TEVR enhancements and their motivation, please see our paper:
+TEVR: Improving XLS-R for German ASR through Token Entropy Variance Reduction
+(Krabbenhöft et al., 2022).
+This pipeline scores a very competitive (as of June 2022) **word error rate of 3.64%** on CommonVoice German.
+To evalue this pipeline yourself and/or on your own data, see the `HF Eval Script.ipynb` Jupyter Notebook
+or use the following python script:
+## Evaluation
+```python
+!pip install --quiet --root-user-action=ignore --upgrade pip
+!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub
+!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
+!pip install --quiet --root-user-action=ignore --upgrade transformers
+!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations
+```
+```python
+from datasets import load_dataset, Audio, load_metric
+from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
+import torchaudio.transforms as T
+import torch
+import unicodedata
+import numpy as np
+import re
+# load testing dataset
+testing_dataset = load_dataset("common_voice", "de", split="test")
+# replace invisible characters with space
+allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
+map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
+replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')
+def text_fix(text):
+    # change ß to ss
+    text = text.replace('ß','ss')
+    # convert dash to space and remove double-space
+    text = text.replace('-',' ').replace('  ',' ').replace('  ',' ')
+    # make lowercase
+    text = text.lower()
+    # remap all invisible characters to space
+    text = text.translate(replacements).strip()
+    # for easier comparison to Zimmermeister, replace unrepresentable characters with ?
+    text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
+    # remove multiple spaces (again)
+    text = ' '.join([w for w in text.split(' ') if w != ''])
+    return text
+# load model
+model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
+model.to('cuda')
+# load processor
+class HajoProcessor(Wav2Vec2ProcessorWithLM):
+    @staticmethod
+    def get_missing_alphabet_tokens(decoder, tokenizer):
+        return []
+processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
+# this function will be called for each WAV file
+def predict_single_audio(batch, image=False):
+    audio = batch['audio']['array']
+    # resample, if needed
+    if batch['audio']['sampling_rate'] != 16000:
+        audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
+    # normalize
+    audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
+    # ask HF processor to prepare audio for GPU eval
+    input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
+    # call model on GPU
+    with torch.no_grad():
+        logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
+    # ask HF processor to decode logits
+    decoded = processor.decode(logits, beam_width=500)
+    # return as dictionary
+    return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }
+# process all audio files
+all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)
+# print results
+print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
+print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
+```
+    WER 3.6433399042523233 %
+    CER 1.5398893560981173 %