1 ---
2 language: fr
3 datasets:
4 - common_voice
5 tags:
6 - audio
7 - automatic-speech-recognition
8 - speech
9 - xlsr-fine-tuning-week
10 license: apache-2.0
11 model-index:
12 - name: wav2vec2-large-xlsr-53-French by Ilyes Rebai
13 results:
14 - task:
15 name: Speech Recognition
16 type: automatic-speech-recognition
17 dataset:
18 name: Common Voice fr
19 type: common_voice
20 args: fr
21 metrics:
22 - name: Test WER (v1.0)
23 type: wer
24 value: 15.97
25 - name: Test WER (v2.0)
26 type: wer
27 value: 14.71
28 - name: Test WER (v3.0)
29 type: wer
30 value: 12.82
31 ---
32 ## Evaluation on Common Voice FR Test
33 The script used for training and evaluation can be found here: https://github.com/irebai/wav2vec2
34
35
36 ```python
37 import torch
38 import torchaudio
39 from datasets import load_dataset, load_metric
40 from transformers import (
41 Wav2Vec2ForCTC,
42 Wav2Vec2Processor,
43 )
44 import re
45
46 model_name = "Ilyes/wav2vec2-large-xlsr-53-french"
47
48
49
50 model = Wav2Vec2ForCTC.from_pretrained(model_name).to('cuda')
51 processor = Wav2Vec2Processor.from_pretrained(model_name)
52
53 ds = load_dataset("common_voice", "fr", split="test", cache_dir="./data/fr")
54
55
56
57 chars_to_ignore_regex = '[\,\?\.\!\;\:\"\“\%\‘\”\�\‘\’\’\’\‘\…\·\!\ǃ\?\«\‹\»\›“\”\\ʿ\ʾ\„\∞\\|\.\,\;\:\*\—\–\─\―\_\/\:\ː\;\,\=\«\»\→]'
58 def map_to_array(batch):
59 speech, _ = torchaudio.load(batch["path"])
60 batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
61 batch["sampling_rate"] = resampler.new_freq
62 batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
63 return batch
64
65 ds = ds.map(map_to_array)
66
67 resampler = torchaudio.transforms.Resample(48_000, 16_000)
68 def map_to_pred(batch):
69 features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
70 input_values = features.input_values.to(device)
71 attention_mask = features.attention_mask.to(device)
72 with torch.no_grad():
73 logits = model(input_values, attention_mask=attention_mask).logits
74 pred_ids = torch.argmax(logits, dim=-1)
75 batch["predicted"] = processor.batch_decode(pred_ids)
76 batch["target"] = batch["sentence"]
77 return batch
78
79 result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))
80 wer = load_metric("wer")
81 print(wer.compute(predictions=result["predicted"], references=result["target"]))
82 ```
83
84 ## Results
85
86 # v0.1
87
88 WER=18.29%
89
90 SER=71.44%
91
92 # v1.0
93
94 WER=15.97%
95
96 CER=5.51%
97
98 # v2.0
99
100 WER=14.71%
101
102 CER=5.06%
103
104 # v3.0
105
106 WER=12.82%
107
108 CER=4.40%
109