--- language: de datasets: - common_voice inference: false metrics: - wer - cer tags: - audio - automatic-speech-recognition - speech - hf-asr-leaderboard license: apache-2.0 model-index: - name: wav2vec 2.0 XLS-R 1B + TEVR tokens + 5-gram LM by Hajo Nils Krabbenhöft results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice de type: common_voice args: de metrics: - name: Test WER type: wer value: 3.6433399042523233 - name: Test CER type: cer value: 1.5398893560981173 --- ## Overview This folder contains a fully trained German speech recognition pipeline consisting of an acoustic model using the new wav2vec 2.0 XLS-R 1B TEVR architecture and a 5-gram KenLM language model. For an explanation of the TEVR enhancements and their motivation, please see our paper: TEVR: Improving XLS-R for German ASR through Token Entropy Variance Reduction (Krabbenhöft et al., 2022). This pipeline scores a very competitive (as of June 2022) **word error rate of 3.64%** on CommonVoice German. To evalue this pipeline yourself and/or on your own data, see the `HF Eval Script.ipynb` Jupyter Notebook or use the following python script: ## Evaluation ```python !pip install --quiet --root-user-action=ignore --upgrade pip !pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub !pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode !pip install --quiet --root-user-action=ignore --upgrade transformers !pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations ``` ```python from datasets import load_dataset, Audio, load_metric from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM import torchaudio.transforms as T import torch import unicodedata import numpy as np import re # load testing dataset testing_dataset = load_dataset("common_voice", "de", split="test") # replace invisible characters with space allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)])) map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-'] replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ') def text_fix(text): # change ß to ss text = text.replace('ß','ss') # convert dash to space and remove double-space text = text.replace('-',' ').replace(' ',' ').replace(' ',' ') # make lowercase text = text.lower() # remap all invisible characters to space text = text.translate(replacements).strip() # for easier comparison to Zimmermeister, replace unrepresentable characters with ? text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text) # remove multiple spaces (again) text = ' '.join([w for w in text.split(' ') if w != '']) return text # load model model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr") model.to('cuda') # load processor class HajoProcessor(Wav2Vec2ProcessorWithLM): @staticmethod def get_missing_alphabet_tokens(decoder, tokenizer): return [] processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr") # this function will be called for each WAV file def predict_single_audio(batch, image=False): audio = batch['audio']['array'] # resample, if needed if batch['audio']['sampling_rate'] != 16000: audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy() # normalize audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7) # ask HF processor to prepare audio for GPU eval input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values # call model on GPU with torch.no_grad(): logits = model(input_values.to('cuda')).logits.cpu().numpy()[0] # ask HF processor to decode logits decoded = processor.decode(logits, beam_width=500) # return as dictionary return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text } # process all audio files all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names) # print results print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%') print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%') ``` WER 3.6433399042523233 % CER 1.5398893560981173 %