jonatasgrosman commited on
Commit
765fb8c
1 Parent(s): 7f6f8d5

add evaluation

Browse files
README.md CHANGED
@@ -2,32 +2,62 @@
2
  language: en
3
  datasets:
4
  - common_voice
 
5
  metrics:
6
  - wer
7
  - cer
8
  tags:
 
9
  - audio
10
  - automatic-speech-recognition
11
  - speech
12
  - xlsr-fine-tuning-week
 
 
13
  license: apache-2.0
14
  model-index:
15
  - name: XLSR Wav2Vec2 English by Jonatas Grosman
16
  results:
17
  - task:
18
- name: Speech Recognition
19
  type: automatic-speech-recognition
20
  dataset:
21
- name: Common Voice en
22
  type: common_voice
23
  args: en
24
  metrics:
25
  - name: Test WER
26
  type: wer
27
- value: 18.98
28
  - name: Test CER
29
  type: cer
30
- value: 8.29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ---
32
 
33
  # Wav2Vec2-Large-XLSR-53-English
@@ -109,83 +139,14 @@ for i, predicted_sentence in enumerate(predicted_sentences):
109
 
110
  ## Evaluation
111
 
112
- The model can be evaluated as follows on the English test data of Common Voice.
113
-
114
- ```python
115
- import torch
116
- import re
117
- import librosa
118
- from datasets import load_dataset, load_metric
119
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
120
-
121
- LANG_ID = "en"
122
- MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
123
- DEVICE = "cuda"
124
-
125
- CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
126
- "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
127
- "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
128
- "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
129
- "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
130
 
131
- test_dataset = load_dataset("common_voice", LANG_ID, split="test")
132
-
133
- wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
134
- cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
135
-
136
- chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
137
-
138
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
139
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
140
- model.to(DEVICE)
141
-
142
- # Preprocessing the datasets.
143
- # We need to read the audio files as arrays
144
- def speech_file_to_array_fn(batch):
145
- with warnings.catch_warnings():
146
- warnings.simplefilter("ignore")
147
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
148
- batch["speech"] = speech_array
149
- batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
150
- return batch
151
-
152
- test_dataset = test_dataset.map(speech_file_to_array_fn)
153
-
154
- # Preprocessing the datasets.
155
- # We need to read the audio files as arrays
156
- def evaluate(batch):
157
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
158
-
159
- with torch.no_grad():
160
- logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
161
-
162
- pred_ids = torch.argmax(logits, dim=-1)
163
- batch["pred_strings"] = processor.batch_decode(pred_ids)
164
- return batch
165
-
166
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
167
 
168
- predictions = [x.upper() for x in result["pred_strings"]]
169
- references = [x.upper() for x in result["sentence"]]
170
 
171
- print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
172
- print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
173
  ```
174
-
175
- **Test Result**:
176
-
177
- In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-06-17). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
178
-
179
- | Model | WER | CER |
180
- | ------------- | ------------- | ------------- |
181
- | jonatasgrosman/wav2vec2-large-xlsr-53-english | **18.98%** | **8.29%** |
182
- | jonatasgrosman/wav2vec2-large-english | 21.53% | 9.66% |
183
- | facebook/wav2vec2-large-960h-lv60-self | 22.03% | 10.39% |
184
- | facebook/wav2vec2-large-960h-lv60 | 23.97% | 11.14% |
185
- | boris/xlsr-en-punctuation | 29.10% | 10.75% |
186
- | facebook/wav2vec2-large-960h | 32.79% | 16.03% |
187
- | facebook/wav2vec2-base-960h | 39.86% | 19.89% |
188
- | facebook/wav2vec2-base-100h | 51.06% | 25.06% |
189
- | elgeish/wav2vec2-large-lv60-timit-asr | 59.96% | 34.28% |
190
- | facebook/wav2vec2-base-10k-voxpopuli-ft-en | 66.41% | 36.76% |
191
- | elgeish/wav2vec2-base-timit-asr | 68.78% | 36.81% |
 
2
  language: en
3
  datasets:
4
  - common_voice
5
+ - mozilla-foundation/common_voice_6_0
6
  metrics:
7
  - wer
8
  - cer
9
  tags:
10
+ - en
11
  - audio
12
  - automatic-speech-recognition
13
  - speech
14
  - xlsr-fine-tuning-week
15
+ - robust-speech-event
16
+ - mozilla-foundation/common_voice_6_0
17
  license: apache-2.0
18
  model-index:
19
  - name: XLSR Wav2Vec2 English by Jonatas Grosman
20
  results:
21
  - task:
22
+ name: Automatic Speech Recognition
23
  type: automatic-speech-recognition
24
  dataset:
25
+ name: Common Voice pt
26
  type: common_voice
27
  args: en
28
  metrics:
29
  - name: Test WER
30
  type: wer
31
+ value: 19.06
32
  - name: Test CER
33
  type: cer
34
+ value: 7.69
35
+ - name: Test WER (+LM)
36
+ type: wer
37
+ value: 14.81
38
+ - name: Test CER (+LM)
39
+ type: cer
40
+ value: 6.84
41
+ - task:
42
+ name: Automatic Speech Recognition
43
+ type: automatic-speech-recognition
44
+ dataset:
45
+ name: Robust Speech Event - Dev Data
46
+ type: speech-recognition-community-v2/dev_data
47
+ args: en
48
+ metrics:
49
+ - name: Test WER
50
+ type: wer
51
+ value: 27.72
52
+ - name: Test CER
53
+ type: cer
54
+ value: 11.65
55
+ - name: Test WER (+LM)
56
+ type: wer
57
+ value: 20.85
58
+ - name: Test CER (+LM)
59
+ type: cer
60
+ value: 11.01
61
  ---
62
 
63
  # Wav2Vec2-Large-XLSR-53-English
 
139
 
140
  ## Evaluation
141
 
142
+ 1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
+ ```bash
145
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test
146
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
+ 2. To evaluate on `speech-recognition-community-v2/dev_data`
 
149
 
150
+ ```bash
151
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
152
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eval.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ from datasets import load_dataset, load_metric, Audio, Dataset
3
+ from transformers import pipeline, AutoFeatureExtractor, AutoTokenizer, AutoConfig, AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM
4
+ import re
5
+ import torch
6
+ import argparse
7
+ from typing import Dict
8
+
9
+ def log_results(result: Dataset, args: Dict[str, str]):
10
+ """ DO NOT CHANGE. This function computes and logs the result metrics. """
11
+
12
+ log_outputs = args.log_outputs
13
+ dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
14
+
15
+ # load metric
16
+ wer = load_metric("wer")
17
+ cer = load_metric("cer")
18
+
19
+ # compute metrics
20
+ wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
21
+ cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
22
+
23
+ # print & log results
24
+ result_str = (
25
+ f"WER: {wer_result}\n"
26
+ f"CER: {cer_result}"
27
+ )
28
+ print(result_str)
29
+
30
+ with open(f"{dataset_id}_eval_results.txt", "w") as f:
31
+ f.write(result_str)
32
+
33
+ # log all results in text file. Possibly interesting for analysis
34
+ if log_outputs is not None:
35
+ pred_file = f"log_{dataset_id}_predictions.txt"
36
+ target_file = f"log_{dataset_id}_targets.txt"
37
+
38
+ with open(pred_file, "w") as p, open(target_file, "w") as t:
39
+
40
+ # mapping function to write output
41
+ def write_to_file(batch, i):
42
+ p.write(f"{i}" + "\n")
43
+ p.write(batch["prediction"] + "\n")
44
+ t.write(f"{i}" + "\n")
45
+ t.write(batch["target"] + "\n")
46
+
47
+ result.map(write_to_file, with_indices=True)
48
+
49
+
50
+ def normalize_text(text: str, invalid_chars_regex: str, to_lower: bool) -> str:
51
+ """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
52
+
53
+ text = text.lower() if to_lower else text.upper()
54
+
55
+ text = re.sub(invalid_chars_regex, " ", text)
56
+
57
+ text = re.sub("\s+", " ", text).strip()
58
+
59
+ return text
60
+
61
+
62
+ def main(args):
63
+ # load dataset
64
+ dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
65
+
66
+ # for testing: only process the first two examples as a test
67
+ # dataset = dataset.select(range(10))
68
+
69
+ # load processor
70
+ if args.greedy:
71
+ processor = Wav2Vec2Processor.from_pretrained(args.model_id)
72
+ decoder = None
73
+ else:
74
+ processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
75
+ decoder = processor.decoder
76
+
77
+ feature_extractor = processor.feature_extractor
78
+ tokenizer = processor.tokenizer
79
+
80
+ # resample audio
81
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
82
+
83
+ # load eval pipeline
84
+ if args.device is None:
85
+ args.device = 0 if torch.cuda.is_available() else -1
86
+
87
+ config = AutoConfig.from_pretrained(args.model_id)
88
+ model = AutoModelForCTC.from_pretrained(args.model_id)
89
+
90
+ #asr = pipeline("automatic-speech-recognition", model=args.model_id, device=args.device)
91
+ asr = pipeline("automatic-speech-recognition", config=config, model=model, tokenizer=tokenizer,
92
+ feature_extractor=feature_extractor, decoder=decoder, device=args.device)
93
+
94
+ # build normalizer config
95
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id)
96
+ tokens = [x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))]
97
+ special_tokens = [
98
+ tokenizer.pad_token, tokenizer.word_delimiter_token,
99
+ tokenizer.unk_token, tokenizer.bos_token,
100
+ tokenizer.eos_token,
101
+ ]
102
+ non_special_tokens = [x for x in tokens if x not in special_tokens]
103
+ invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
104
+ normalize_to_lower = False
105
+ for token in non_special_tokens:
106
+ if token.isalpha() and token.islower():
107
+ normalize_to_lower = True
108
+ break
109
+
110
+ # map function to decode audio
111
+ def map_to_pred(batch, args=args, asr=asr, invalid_chars_regex=invalid_chars_regex, normalize_to_lower=normalize_to_lower):
112
+ prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
113
+
114
+ batch["prediction"] = prediction["text"]
115
+ batch["target"] = normalize_text(batch["sentence"], invalid_chars_regex, normalize_to_lower)
116
+ return batch
117
+
118
+ # run inference on all examples
119
+ result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
120
+
121
+ # filtering out empty targets
122
+ result = result.filter(lambda example: example["target"] != "")
123
+
124
+ # compute and log_results
125
+ # do not change function below
126
+ log_results(result, args)
127
+
128
+
129
+ if __name__ == "__main__":
130
+ parser = argparse.ArgumentParser()
131
+
132
+ parser.add_argument(
133
+ "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
134
+ )
135
+ parser.add_argument(
136
+ "--dataset", type=str, required=True, help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets"
137
+ )
138
+ parser.add_argument(
139
+ "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
140
+ )
141
+ parser.add_argument(
142
+ "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
143
+ )
144
+ parser.add_argument(
145
+ "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds."
146
+ )
147
+ parser.add_argument(
148
+ "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds."
149
+ )
150
+ parser.add_argument(
151
+ "--log_outputs", action='store_true', help="If defined, write outputs to log file for analysis."
152
+ )
153
+ parser.add_argument(
154
+ "--greedy", action='store_true', help="If defined, the LM will be ignored during inference."
155
+ )
156
+ parser.add_argument(
157
+ "--device",
158
+ type=int,
159
+ default=None,
160
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
161
+ )
162
+ args = parser.parse_args()
163
+
164
+ main(args)
full_eval.sh ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CV - TEST
2
+
3
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test --log_outputs --greedy
4
+ mv log_mozilla-foundation_common_voice_6_0_en_test_predictions.txt log_mozilla-foundation_common_voice_6_0_en_test_predictions_greedy.txt
5
+ mv mozilla-foundation_common_voice_6_0_en_test_eval_results.txt mozilla-foundation_common_voice_6_0_en_test_eval_results_greedy.txt
6
+
7
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test --log_outputs
8
+
9
+ # HF EVENT - DEV
10
+
11
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs --greedy
12
+ mv log_speech-recognition-community-v2_dev_data_en_validation_predictions.txt log_speech-recognition-community-v2_dev_data_en_validation_predictions_greedy.txt
13
+ mv speech-recognition-community-v2_dev_data_en_validation_eval_results.txt speech-recognition-community-v2_dev_data_en_validation_eval_results_greedy.txt
14
+
15
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
log_mozilla-foundation_common_voice_6_0_en_test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_en_test_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_en_test_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_en_validation_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_en_validation_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_en_validation_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
mozilla-foundation_common_voice_6_0_en_test_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.1481828839390387
2
+ CER: 0.06848087313203592
mozilla-foundation_common_voice_6_0_en_test_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.19067492882264278
2
+ CER: 0.07694957927516068
speech-recognition-community-v2_dev_data_en_validation_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.2085057090848916
2
+ CER: 0.11011805154105943
speech-recognition-community-v2_dev_data_en_validation_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.27722157868608305
2
+ CER: 0.11652265190008215