jonatasgrosman commited on
Commit
877efb7
1 Parent(s): 77c5e23

update evaluation

Browse files
README.md CHANGED
@@ -2,20 +2,24 @@
2
  language: ru
3
  datasets:
4
  - common_voice
 
5
  metrics:
6
  - wer
7
  - cer
8
  tags:
 
9
  - audio
10
  - automatic-speech-recognition
11
  - speech
12
  - xlsr-fine-tuning-week
 
 
13
  license: apache-2.0
14
  model-index:
15
  - name: XLSR Wav2Vec2 Russian by Jonatas Grosman
16
  results:
17
  - task:
18
- name: Speech Recognition
19
  type: automatic-speech-recognition
20
  dataset:
21
  name: Common Voice ru
@@ -24,11 +28,36 @@ model-index:
24
  metrics:
25
  - name: Test WER
26
  type: wer
27
- value: 13.38
28
  - name: Test CER
29
  type: cer
30
- value: 2.86
31
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ---
33
 
34
  # Wav2Vec2-Large-XLSR-53-Russian
@@ -110,74 +139,14 @@ for i, predicted_sentence in enumerate(predicted_sentences):
110
 
111
  ## Evaluation
112
 
113
- The model can be evaluated as follows on the Russian test data of Common Voice.
114
-
115
- ```python
116
- import torch
117
- import re
118
- import librosa
119
- from datasets import load_dataset, load_metric
120
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
121
-
122
- LANG_ID = "ru"
123
- MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"
124
- DEVICE = "cuda"
125
-
126
- CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
127
- "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
128
- "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
129
- "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
130
- "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
131
-
132
- test_dataset = load_dataset("common_voice", LANG_ID, split="test")
133
 
134
- wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
135
- cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
136
-
137
- chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
138
-
139
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
140
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
141
- model.to(DEVICE)
142
-
143
- # Preprocessing the datasets.
144
- # We need to read the audio files as arrays
145
- def speech_file_to_array_fn(batch):
146
- with warnings.catch_warnings():
147
- warnings.simplefilter("ignore")
148
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
149
- batch["speech"] = speech_array
150
- batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
151
- return batch
152
-
153
- test_dataset = test_dataset.map(speech_file_to_array_fn)
154
-
155
- # Preprocessing the datasets.
156
- # We need to read the audio files as arrays
157
- def evaluate(batch):
158
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
159
-
160
- with torch.no_grad():
161
- logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
162
-
163
- pred_ids = torch.argmax(logits, dim=-1)
164
- batch["pred_strings"] = processor.batch_decode(pred_ids)
165
- return batch
166
-
167
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
168
-
169
- predictions = [x.upper() for x in result["pred_strings"]]
170
- references = [x.upper() for x in result["sentence"]]
171
-
172
- print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
173
- print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
174
  ```
175
 
176
- **Test Result**:
177
 
178
- In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-22). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
179
-
180
- | Model | WER | CER |
181
- | ------------- | ------------- | ------------- |
182
- | jonatasgrosman/wav2vec2-large-xlsr-53-russian | **13.38%** | **2.86%** |
183
- | anton-l/wav2vec2-large-xlsr-53-russian | 19.49% | 4.15% |
 
2
  language: ru
3
  datasets:
4
  - common_voice
5
+ - mozilla-foundation/common_voice_6_0
6
  metrics:
7
  - wer
8
  - cer
9
  tags:
10
+ - ru
11
  - audio
12
  - automatic-speech-recognition
13
  - speech
14
  - xlsr-fine-tuning-week
15
+ - robust-speech-event
16
+ - mozilla-foundation/common_voice_6_0
17
  license: apache-2.0
18
  model-index:
19
  - name: XLSR Wav2Vec2 Russian by Jonatas Grosman
20
  results:
21
  - task:
22
+ name: Automatic Speech Recognition
23
  type: automatic-speech-recognition
24
  dataset:
25
  name: Common Voice ru
 
28
  metrics:
29
  - name: Test WER
30
  type: wer
31
+ value: 13.30
32
  - name: Test CER
33
  type: cer
34
+ value: 2.88
35
+ - name: Test WER (+LM)
36
+ type: wer
37
+ value: 9.57
38
+ - name: Test CER (+LM)
39
+ type: cer
40
+ value: 2.24
41
+ - task:
42
+ name: Automatic Speech Recognition
43
+ type: automatic-speech-recognition
44
+ dataset:
45
+ name: Robust Speech Event - Dev Data
46
+ type: speech-recognition-community-v2/dev_data
47
+ args: ru
48
+ metrics:
49
+ - name: Test WER
50
+ type: wer
51
+ value: 40.22
52
+ - name: Test CER
53
+ type: cer
54
+ value: 14.80
55
+ - name: Test WER (+LM)
56
+ type: wer
57
+ value: 33.61
58
+ - name: Test CER (+LM)
59
+ type: cer
60
+ value: 13.50
61
  ---
62
 
63
  # Wav2Vec2-Large-XLSR-53-Russian
 
139
 
140
  ## Evaluation
141
 
142
+ 1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
+ ```bash
145
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset mozilla-foundation/common_voice_6_0 --config ru --split test
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ```
147
 
148
+ 2. To evaluate on `speech-recognition-community-v2/dev_data`
149
 
150
+ ```bash
151
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset speech-recognition-community-v2/dev_data --config ru --split validation --chunk_length_s 5.0 --stride_length_s 1.0
152
+ ```
 
 
 
eval.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ from datasets import load_dataset, load_metric, Audio, Dataset
3
+ from transformers import pipeline, AutoFeatureExtractor, AutoTokenizer, AutoConfig, AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM
4
+ import re
5
+ import torch
6
+ import argparse
7
+ from typing import Dict
8
+
9
+ def log_results(result: Dataset, args: Dict[str, str]):
10
+ """ DO NOT CHANGE. This function computes and logs the result metrics. """
11
+
12
+ log_outputs = args.log_outputs
13
+ dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
14
+
15
+ # load metric
16
+ wer = load_metric("wer")
17
+ cer = load_metric("cer")
18
+
19
+ # compute metrics
20
+ wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
21
+ cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
22
+
23
+ # print & log results
24
+ result_str = (
25
+ f"WER: {wer_result}\n"
26
+ f"CER: {cer_result}"
27
+ )
28
+ print(result_str)
29
+
30
+ with open(f"{dataset_id}_eval_results.txt", "w") as f:
31
+ f.write(result_str)
32
+
33
+ # log all results in text file. Possibly interesting for analysis
34
+ if log_outputs is not None:
35
+ pred_file = f"log_{dataset_id}_predictions.txt"
36
+ target_file = f"log_{dataset_id}_targets.txt"
37
+
38
+ with open(pred_file, "w") as p, open(target_file, "w") as t:
39
+
40
+ # mapping function to write output
41
+ def write_to_file(batch, i):
42
+ p.write(f"{i}" + "\n")
43
+ p.write(batch["prediction"] + "\n")
44
+ t.write(f"{i}" + "\n")
45
+ t.write(batch["target"] + "\n")
46
+
47
+ result.map(write_to_file, with_indices=True)
48
+
49
+
50
+ def normalize_text(text: str, invalid_chars_regex: str, to_lower: bool) -> str:
51
+ """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
52
+
53
+ text = text.lower() if to_lower else text.upper()
54
+
55
+ text = re.sub(invalid_chars_regex, " ", text)
56
+
57
+ text = re.sub("\s+", " ", text).strip()
58
+
59
+ return text
60
+
61
+
62
+ def main(args):
63
+ # load dataset
64
+ dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
65
+
66
+ # for testing: only process the first two examples as a test
67
+ # dataset = dataset.select(range(10))
68
+
69
+ # load processor
70
+ if args.greedy:
71
+ processor = Wav2Vec2Processor.from_pretrained(args.model_id)
72
+ decoder = None
73
+ else:
74
+ processor = Wav2Vec2ProcessorWithLM.from_pretrained(args.model_id)
75
+ decoder = processor.decoder
76
+
77
+ feature_extractor = processor.feature_extractor
78
+ tokenizer = processor.tokenizer
79
+
80
+ # resample audio
81
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
82
+
83
+ # load eval pipeline
84
+ if args.device is None:
85
+ args.device = 0 if torch.cuda.is_available() else -1
86
+
87
+ config = AutoConfig.from_pretrained(args.model_id)
88
+ model = AutoModelForCTC.from_pretrained(args.model_id)
89
+
90
+ #asr = pipeline("automatic-speech-recognition", model=args.model_id, device=args.device)
91
+ asr = pipeline("automatic-speech-recognition", config=config, model=model, tokenizer=tokenizer,
92
+ feature_extractor=feature_extractor, decoder=decoder, device=args.device)
93
+
94
+ # build normalizer config
95
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id)
96
+ tokens = [x for x in tokenizer.convert_ids_to_tokens(range(0, tokenizer.vocab_size))]
97
+ special_tokens = [
98
+ tokenizer.pad_token, tokenizer.word_delimiter_token,
99
+ tokenizer.unk_token, tokenizer.bos_token,
100
+ tokenizer.eos_token,
101
+ ]
102
+ non_special_tokens = [x for x in tokens if x not in special_tokens]
103
+ invalid_chars_regex = f"[^\s{re.escape(''.join(set(non_special_tokens)))}]"
104
+ normalize_to_lower = False
105
+ for token in non_special_tokens:
106
+ if token.isalpha() and token.islower():
107
+ normalize_to_lower = True
108
+ break
109
+
110
+ # map function to decode audio
111
+ def map_to_pred(batch, args=args, asr=asr, invalid_chars_regex=invalid_chars_regex, normalize_to_lower=normalize_to_lower):
112
+ prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
113
+
114
+ batch["prediction"] = prediction["text"]
115
+ batch["target"] = normalize_text(batch["sentence"], invalid_chars_regex, normalize_to_lower)
116
+ return batch
117
+
118
+ # run inference on all examples
119
+ result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
120
+
121
+ # filtering out empty targets
122
+ result = result.filter(lambda example: example["target"] != "")
123
+
124
+ # compute and log_results
125
+ # do not change function below
126
+ log_results(result, args)
127
+
128
+
129
+ if __name__ == "__main__":
130
+ parser = argparse.ArgumentParser()
131
+
132
+ parser.add_argument(
133
+ "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
134
+ )
135
+ parser.add_argument(
136
+ "--dataset", type=str, required=True, help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets"
137
+ )
138
+ parser.add_argument(
139
+ "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
140
+ )
141
+ parser.add_argument(
142
+ "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
143
+ )
144
+ parser.add_argument(
145
+ "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds."
146
+ )
147
+ parser.add_argument(
148
+ "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds."
149
+ )
150
+ parser.add_argument(
151
+ "--log_outputs", action='store_true', help="If defined, write outputs to log file for analysis."
152
+ )
153
+ parser.add_argument(
154
+ "--greedy", action='store_true', help="If defined, the LM will be ignored during inference."
155
+ )
156
+ parser.add_argument(
157
+ "--device",
158
+ type=int,
159
+ default=None,
160
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
161
+ )
162
+ args = parser.parse_args()
163
+
164
+ main(args)
full_eval.sh ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CV - TEST
2
+
3
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset mozilla-foundation/common_voice_6_0 --config ru --split test --log_outputs --greedy
4
+ mv log_mozilla-foundation_common_voice_6_0_ru_test_predictions.txt log_mozilla-foundation_common_voice_6_0_ru_test_predictions_greedy.txt
5
+ mv mozilla-foundation_common_voice_6_0_ru_test_eval_results.txt mozilla-foundation_common_voice_6_0_ru_test_eval_results_greedy.txt
6
+
7
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset mozilla-foundation/common_voice_6_0 --config ru --split test --log_outputs
8
+
9
+ # HF EVENT - DEV
10
+
11
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset speech-recognition-community-v2/dev_data --config ru --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs --greedy
12
+ mv log_speech-recognition-community-v2_dev_data_ru_validation_predictions.txt log_speech-recognition-community-v2_dev_data_ru_validation_predictions_greedy.txt
13
+ mv speech-recognition-community-v2_dev_data_ru_validation_eval_results.txt speech-recognition-community-v2_dev_data_ru_validation_eval_results_greedy.txt
14
+
15
+ python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-russian --dataset speech-recognition-community-v2/dev_data --config ru --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
log_mozilla-foundation_common_voice_6_0_ru_test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_ru_test_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_6_0_ru_test_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_ru_validation_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_ru_validation_predictions_greedy.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_speech-recognition-community-v2_dev_data_ru_validation_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
mozilla-foundation_common_voice_6_0_ru_test_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.09577627565075995
2
+ CER: 0.022471409641103304
mozilla-foundation_common_voice_6_0_ru_test_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.1330815852068859
2
+ CER: 0.028824204091177356
speech-recognition-community-v2_dev_data_ru_validation_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.3361409730688241
2
+ CER: 0.13507897295031526
speech-recognition-community-v2_dev_data_ru_validation_eval_results_greedy.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.4022498060512025
2
+ CER: 0.14809992240941075