File size: 6,255 Bytes
03930f8 a0402ee 03930f8 0503bb2 03930f8 0503bb2 03930f8 ed79b5b 03930f8 ed79b5b 03930f8 0503bb2 7accec1 0503bb2 03930f8 0503bb2 03930f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
language: de
datasets:
- common_voice
inference: false
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec 2.0 XLS-R 1B + TEVR tokens + 5-gram LM by Hajo Nils Krabbenhöft
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice de
type: common_voice
args: de
metrics:
- name: Test WER
type: wer
value: 3.6433399042523233
- name: Test CER
type: cer
value: 1.5398893560981173
---
## Overview
This folder contains a fully trained German speech recognition pipeline
consisting of an acoustic model using the new wav2vec 2.0 XLS-R 1B **TEVR** architecture
and a 5-gram KenLM language model.
For an explanation of the TEVR enhancements and their motivation, please see our paper:
[TEVR: Improving Speech Recognition by Token Entropy Variance Reduction](https://arxiv.org/abs/2206.12693).
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tevr-improving-speech-recognition-by-token/speech-recognition-on-common-voice-german)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-german?p=tevr-improving-speech-recognition-by-token)
This pipeline scores a very competitive (as of June 2022) **word error rate of 3.64%** on CommonVoice German.
The character error rate was 1.54%.
## Citation
If you use this ASR pipeline for research, please cite:
```bibtex
@misc{https://doi.org/10.48550/arxiv.2206.12693,
doi = {10.48550/ARXIV.2206.12693},
url = {https://arxiv.org/abs/2206.12693},
author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},
keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},
title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
```
## TEVR Tokenizer Creation / Testing
See https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de for:
- our trained ByT5 model used to calculate the entropies in the paper
- a Jupyter Notebook to generate a TEVR Tokenizer from a text corpus
- a Jupyter Notebook to generate the illustration image in the paper
## Evaluation
To evalue this pipeline yourself and/or on your own data, see the `HF Eval Script.ipynb` Jupyter Notebook
or use the following python script:
```python
!pip install --quiet --root-user-action=ignore --upgrade pip
!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub
!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
!pip install --quiet --root-user-action=ignore --upgrade transformers
!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations
```
```python
from datasets import load_dataset, Audio, load_metric
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
import torchaudio.transforms as T
import torch
import unicodedata
import numpy as np
import re
# load testing dataset
testing_dataset = load_dataset("common_voice", "de", split="test")
# replace invisible characters with space
allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')
def text_fix(text):
# change ß to ss
text = text.replace('ß','ss')
# convert dash to space and remove double-space
text = text.replace('-',' ').replace(' ',' ').replace(' ',' ')
# make lowercase
text = text.lower()
# remap all invisible characters to space
text = text.translate(replacements).strip()
# for easier comparison to Zimmermeister, replace unrepresentable characters with ?
text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
# remove multiple spaces (again)
text = ' '.join([w for w in text.split(' ') if w != ''])
return text
# load model
model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
model.to('cuda')
# load processor
class HajoProcessor(Wav2Vec2ProcessorWithLM):
@staticmethod
def get_missing_alphabet_tokens(decoder, tokenizer):
return []
processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
# this function will be called for each WAV file
def predict_single_audio(batch, image=False):
audio = batch['audio']['array']
# resample, if needed
if batch['audio']['sampling_rate'] != 16000:
audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
# normalize
audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
# ask HF processor to prepare audio for GPU eval
input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
# call model on GPU
with torch.no_grad():
logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
# ask HF processor to decode logits
decoded = processor.decode(logits, beam_width=500)
# return as dictionary
return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }
# process all audio files
all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)
# print results
print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
```
WER 3.6433399042523233 %
CER 1.5398893560981173 %
|