File size: 21,854 Bytes
adc9aa0
2d3f123
 
adc9aa0
 
2d3f123
 
adc9aa0
 
 
 
 
 
 
 
d8c7361
adc9aa0
d8c7361
adc9aa0
d8c7361
adc9aa0
 
2d3f123
adc9aa0
7261829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adc9aa0
 
d8c7361
adc9aa0
 
 
d8c7361
adc9aa0
 
 
d8c7361
adc9aa0
 
 
d8c7361
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adc9aa0
 
 
d8c7361
adc9aa0
 
 
 
 
 
d8c7361
adc9aa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8c7361
 
 
 
 
 
 
adc9aa0
 
 
 
 
 
 
bfa2a0a
 
 
 
 
 
 
5326dd8
bfa2a0a
5326dd8
 
 
 
 
 
 
 
bfa2a0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5326dd8
bfa2a0a
 
 
 
 
 
 
5326dd8
bfa2a0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5326dd8
 
bfa2a0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5326dd8
bfa2a0a
 
 
5326dd8
 
bfa2a0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
---
language:
- eo
license: apache-2.0
tags:
- automatic-speech-recognition
- mozilla-foundation/common_voice_13_0
- generated_from_trainer
metrics:
- wer
model-index:
- name: wav2vec2-common_voice_13_0-eo-3
  results: []
---

# wav2vec2-common_voice_13_0-eo-3, an Esperanto speech recognizer

This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) Esperanto dataset.
It achieves the following results on the evaluation set:

- Loss: 0.2191
- Cer: 0.0208
- Wer: 0.0687

The first 10 samples in the test set:

| Actual<br>Predicted | CER |
|:--------------------|:----|
| `la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo`<br>`la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo` | 0.0 |
| `en la sekva jaro li ricevis premion`<br>`en la sekva jaro li ricevis prenion` | 0.02857142857142857 |
| `ŝi studis historion ĉe la universitato de brita kolumbio`<br>`ŝi studis historion ĉe la universitato de brita kolumbio` | 0.0 |
| `larĝaj ŝtupoj kuras al la fasado`<br>`larĝaj ŝtupoj kuras al la fasado` | 0.0 |
| `la municipo ĝuas duan epokon de etendo kaj disvolviĝo`<br>`la municipo ĝuas duonepokon de tendo kaj disvolviĝo` | 0.05660377358490566 |
| `li estis ankaŭ katedrestro kaj dekano`<br>`li estis ankaŭ katedresto kaj dekano` | 0.02702702702702703 |
| `librovendejo apartenas al la muzeo`<br>`librovendejo apartenas al la muzeo` | 0.0 |
| `ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj`<br>`ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj` | 0.02702702702702703 |
| `unue ili estas ruĝaj poste brunaj`<br>`unue ili estas ruĝaj poste brunaj` | 0.0 |
| `la loĝantaro laboras en la proksima ĉefurbo`<br>`la loĝantaro laboras en la proksima ĉefurbo` | 0.0 |


## Model description

See [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).

## Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

## Training and evaluation data

The training split was set to `train[:15000]` while the eval split was set to `validation[:1500]`.

## Training procedure

I used [`run_speech_recognition_ctc.py`](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) with the following `train.json` file passed to it:

```json
{
  "dataset_name": "mozilla-foundation/common_voice_13_0",
  "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "dataset_config_name": "eo",
  "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
  "train_split_name": "train[:15000]",
  "eval_split_name": "validation[:1500]",
  "eval_metrics": ["cer", "wer"],
  "overwrite_output_dir": true,
  "preprocessing_num_workers": 8,
  "num_train_epochs": 100,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": true,
  "learning_rate": 3e-5,
  "warmup_steps": 500,
  "evaluation_strategy": "steps",
  "text_column_name": "sentence",
  "length_column_name": "input_length",
  "save_steps": 1000,
  "eval_steps": 1000,
  "layerdrop": 0.1,
  "save_total_limit": 3,
  "freeze_feature_encoder": true,
  "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫?",
  "chars_to_substitute": {
    "przy": "pŝe",
    "byn": "bin",
    "cx": "ĉ",
    "sx": "ŝ",
    "fi": "fi",
    "fl": "fl",
    "ǔ": "ŭ",
    "ñ": "nj",
    "á": "a",
    "é": "e",
    "ü": "ŭ",
    "y": "j",
    "qu": "ku"
  },
  "fp16": true,
  "group_by_length": true,
  "push_to_hub": true,
  "do_train": true,
  "do_eval": true
}
```

I went through the dataset to find non-speech characters, and these were placed in `chars_to_ignore`. In addition, there were character sequences that could be transcribed to Esperanto phonemes, and these were placed as a dictionary in `chars_to_substitute`. This required adding such an argument to the program:

```py
def dict_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)

@dataclass
class DataTrainingArguments:
  ...
    chars_to_substitute: Optional[Dict[str, str]] = dict_field(
        default=None,
        metadata={"help": "A dict of characters to replace."},
    )

```

Then I copied `remove_special_characters` to do the actual substitution:

```py
    def remove_special_characters(batch):
        text = batch[text_column_name]
        if chars_to_ignore_regex is not None:
            text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
        batch["target_text"] = text.lower() + " "
        return batch

    def substitute_characters(batch):
        text: str = batch["target_text"]
        if data_args.chars_to_substitute is not None:
            for k, v in data_args.chars_to_substitute.items():
                text.replace(k, v)
        batch["target_text"] = text.lower()
        return batch

    with training_args.main_process_first(desc="dataset map special characters removal"):
        raw_datasets = raw_datasets.map(
            remove_special_characters,
            remove_columns=[text_column_name],
            desc="remove special characters from datasets",
        )

    with training_args.main_process_first(desc="dataset map special characters substitute"):
        raw_datasets = raw_datasets.map(
            substitute_characters,
            desc="substitute special characters in datasets",
        )
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- layerdrop: 0.1
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 100
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Cer    | Validation Loss | Wer    |
|:-------------:|:-----:|:-----:|:------:|:---------------:|:------:|
| 2.6416        | 2.13  | 1000  | 0.1541 | 0.8599          | 0.6449 |
| 0.2633        | 4.27  | 2000  | 0.0335 | 0.1897          | 0.1431 |
| 0.1739        | 6.4   | 3000  | 0.0289 | 0.1732          | 0.1145 |
| 0.1378        | 8.53  | 4000  | 0.0276 | 0.1729          | 0.1066 |
| 0.1172        | 10.67 | 5000  | 0.0268 | 0.1773          | 0.1019 |
| 0.1049        | 12.8  | 6000  | 0.0255 | 0.1701          | 0.0937 |
| 0.0951        | 14.93 | 7000  | 0.0253 | 0.1718          | 0.0933 |
| 0.0851        | 17.07 | 8000  | 0.0239 | 0.1787          | 0.0834 |
| 0.0809        | 19.2  | 9000  | 0.0235 | 0.1802          | 0.0835 |
| 0.0756        | 21.33 | 10000 | 0.0239 | 0.1784          | 0.0855 |
| 0.0708        | 23.47 | 11000 | 0.0235 | 0.1748          | 0.0824 |
| 0.0657        | 25.6  | 12000 | 0.0228 | 0.1830          | 0.0796 |
| 0.0605        | 27.73 | 13000 | 0.0230 | 0.1896          | 0.0798 |
| 0.0583        | 29.87 | 14000 | 0.0224 | 0.1889          | 0.0778 |
| 0.0608        | 32.0  | 15000 | 0.0223 | 0.1849          | 0.0757 |
| 0.0556        | 34.13 | 16000 | 0.0223 | 0.1872          | 0.0767 |
| 0.0534        | 36.27 | 17000 | 0.0221 | 0.1893          | 0.0751 |
| 0.0523        | 38.4  | 18000 | 0.0218 | 0.1925          | 0.0729 |
| 0.0494        | 40.53 | 19000 | 0.0221 | 0.1957          | 0.0745 |
| 0.0475        | 42.67 | 20000 | 0.0217 | 0.1961          | 0.0740 |
| 0.048         | 44.8  | 21000 | 0.0214 | 0.1957          | 0.0714 |
| 0.0459        | 46.93 | 22000 | 0.0215 | 0.1968          | 0.0717 |
| 0.0435        | 49.07 | 23000 | 0.0217 | 0.2008          | 0.0717 |
| 0.0428        | 51.2  | 24000 | 0.0212 | 0.1991          | 0.0696 |
| 0.0418        | 53.33 | 25000 | 0.0215 | 0.2034          | 0.0714 |
| 0.0404        | 55.47 | 26000 | 0.0210 | 0.2014          | 0.0684 |
| 0.0394        | 57.6  | 27000 | 0.0210 | 0.2050          | 0.0681 |
| 0.0399        | 59.73 | 28000 | 0.0211 | 0.2039          | 0.0700 |
| 0.0389        | 61.87 | 29000 | 0.0214 | 0.2091          | 0.0694 |
| 0.038         | 64.0  | 30000 | 0.0210 | 0.2100          | 0.0702 |
| 0.0361        | 66.13 | 31000 | 0.0215 | 0.2119          | 0.0703 |
| 0.0359        | 68.27 | 32000 | 0.0213 | 0.2108          | 0.0714 |
| 0.0354        | 70.4  | 33000 | 0.0211 | 0.2120          | 0.0699 |
| 0.0364        | 72.53 | 34000 | 0.0211 | 0.2128          | 0.0688 |
| 0.0361        | 74.67 | 35000 | 0.0212 | 0.2134          | 0.0694 |
| 0.0332        | 76.8  | 36000 | 0.0210 | 0.2176          | 0.0698 |
| 0.0341        | 78.93 | 37000 | 0.0208 | 0.2170          | 0.0688 |
| 0.032         | 81.07 | 38000 | 0.0209 | 0.2157          | 0.0686 |
| 0.0318        | 83.33 | 39000 | 0.0209 | 0.2166          | 0.0685 |
| 0.0325        | 85.47 | 40000 | 0.0209 | 0.2172          | 0.0687 |
| 0.0316        | 87.6  | 41000 | 0.0208 | 0.2181          | 0.0678 |
| 0.0302        | 89.73 | 42000 | 0.0208 | 0.2171          | 0.0679 |
| 0.0318        | 91.87 | 43000 | 0.0211 | 0.2179          | 0.0702 |
| 0.0314        | 94.0  | 44000 | 0.0208 | 0.2186          | 0.0690 |
| 0.0309        | 96.13 | 45000 | 0.0210 | 0.2193          | 0.0696 |
| 0.031         | 98.27 | 46000 | 0.0208 | 0.2191          | 0.0686 |

### Framework versions

- Transformers 4.29.1
- Pytorch 2.0.1+cu118
- Datasets 2.12.0
- Tokenizers 0.13.3

## Discussion

### Nans and Infs

While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:

| file | Actual<br>---<br>Predicted | CER | Comment |
|:-----|:--------------------|:----|:--------|
|common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>---<br>a taaj keo eoj eejn kigos eegoj  eioeegiooj| 0.61 | No audio |
|common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>---<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
|common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>---<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
|2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>---<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
|7333 | poste sekvas difinoj de la termino<br>---<br>po | 0.94 | No audio |
|7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>---<br>po | 0.98 | No audio |
|7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>---<br>po | 0.97 | No audio |
|11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>---<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |

Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.

You can see that the model is trying to hallucinate the target when there's little or no audio. This is terrible for realistically reporting what was said. I'd also hope that there is some measure of certainty, and maybe only go with transcriptions that have relatively high certainty. However, I can't find how to get at a certainty value.

The Common Voice dataset also contains upvotes and downvotes. Of the high CER sentences above, all had 2 upvotes, with some having 0 downvotes, and some having 1. So we cannot rely on upvotes or downvotes to detect quality.

So what to do?

### Alternative 1

Despite these zero- and low-quality files, training seems to work OK. However, we still need to address when loss becomes `nan` or `inf` because that ruins the calculation.

By running `run_speech_recognition_ctc` with `do_train=false`, setting `model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3"`, setting `eval_split_name` to either `test`, `validation`, or `train`, and also modifying `trainer.py` as follows, I can check if any losses are nan or inf:

```py
        # To be JSON-serializable, we need to remove numpy types or zero-d tensors
        metrics = denumpify_detensorize(metrics)

        if all_losses is not None:
            loss_nan = np.where(np.isnan(all_losses))
            if len(loss_nan) != 0:
                print(f'LOSSES ARE NAN: {loss_nan}')
            loss_inf = np.where(np.isinf(all_losses))
            if len(loss_inf) != 0:
                print(f'LOSSES ARE INF: {loss_inf}')
            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
```

Doing this shows that of the 14913 examples in `test`, the following example results in `inf` loss:

`common_voice_eo_25167318.mp3`

The audio on this is severly garbled. This should absolutely be filtered out of the test set.

No `validation` samples result in `inf` or `nan`.

The following 18 out of 143984 examples in `train` result in `inf` loss:

```txt
common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3
```

Those files have no audio.

### Alternative 2

Another possibility is just to go through the audio files and throw away any where the peak audio isn't above some threshold.

### Alternative 3

Since this model seems to work well enough, I could run inference on all samples, and just discard the ones where the CER (as determined by this model) is too high, say above 0.5. Then use that to filter the examples and train another model. These high-CER examples are:

#### Test set

71 of 14913 examples in the test set show high CER.

```txt
common_voice_eo_25214319.mp3
common_voice_eo_25006596.mp3
common_voice_eo_27472721.mp3
common_voice_eo_27715088.mp3
common_voice_eo_27715091.mp3
common_voice_eo_26677019.mp3
common_voice_eo_26677023.mp3
common_voice_eo_20555291.mp3
common_voice_eo_25001942.mp3
common_voice_eo_25457354.mp3
common_voice_eo_25457355.mp3
common_voice_eo_25457365.mp3
common_voice_eo_25457373.mp3
common_voice_eo_25457396.mp3
common_voice_eo_25457397.mp3
common_voice_eo_25457409.mp3
common_voice_eo_25457410.mp3
common_voice_eo_25457412.mp3
common_voice_eo_25457442.mp3
common_voice_eo_25457444.mp3
common_voice_eo_25457445.mp3
common_voice_eo_25457577.mp3
common_voice_eo_25457578.mp3
common_voice_eo_28064453.mp3
common_voice_eo_25047803.mp3
common_voice_eo_25048418.mp3
common_voice_eo_25048419.mp3
common_voice_eo_25048421.mp3
common_voice_eo_25048423.mp3
common_voice_eo_25048428.mp3
common_voice_eo_25048574.mp3
common_voice_eo_25885643.mp3
common_voice_eo_25885645.mp3
common_voice_eo_26794882.mp3
common_voice_eo_27356529.mp3
common_voice_eo_25012640.mp3
common_voice_eo_25303457.mp3
common_voice_eo_18153931.mp3
common_voice_eo_18776206.mp3
common_voice_eo_18776208.mp3
common_voice_eo_18776219.mp3
common_voice_eo_18776220.mp3
common_voice_eo_18776222.mp3
common_voice_eo_18776223.mp3
common_voice_eo_18776236.mp3
common_voice_eo_18776238.mp3
common_voice_eo_18776244.mp3
common_voice_eo_18776248.mp3
common_voice_eo_18776285.mp3
common_voice_eo_18776287.mp3
common_voice_eo_18776297.mp3
common_voice_eo_18776298.mp3
common_voice_eo_25047998.mp3
common_voice_eo_25047999.mp3
common_voice_eo_25048000.mp3
common_voice_eo_25048001.mp3
common_voice_eo_25048002.mp3
common_voice_eo_25053113.mp3
common_voice_eo_25068355.mp3
common_voice_eo_25333056.mp3
common_voice_eo_25371639.mp3
common_voice_eo_25371640.mp3
common_voice_eo_25371641.mp3
common_voice_eo_25371642.mp3
common_voice_eo_25371643.mp3
common_voice_eo_22441946.mp3
common_voice_eo_26622121.mp3
common_voice_eo_25167318.mp3
common_voice_eo_25252685.mp3
common_voice_eo_25252698.mp3
common_voice_eo_25518636.mp3
```

Note on two of the examples: We know that _saluton kiel vi fartas_ ("Hello, how are you") and _atendu momenton_ ("Wait a moment") is a good start in learning Esperanto, but if that's not the text to record, you're not really helping.

#### Validation set

17 of 14909 examples in the test set show high CER.

```txt
common_voice_eo_25392669.mp3
common_voice_eo_25392674.mp3
common_voice_eo_25392675.mp3
common_voice_eo_25392676.mp3
common_voice_eo_25392678.mp3
common_voice_eo_25392693.mp3
common_voice_eo_25392694.mp3
common_voice_eo_25392695.mp3
common_voice_eo_25392697.mp3
common_voice_eo_25392701.mp3
common_voice_eo_25392702.mp3
common_voice_eo_25392708.mp3
common_voice_eo_25392709.mp3
common_voice_eo_25408881.mp3
common_voice_eo_25408882.mp3
common_voice_eo_25408885.mp3
common_voice_eo_27380623.mp3
```

I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.

#### Training set

135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.

```txt
common_voice_eo_25365027.mp3
common_voice_eo_25365472.mp3
common_voice_eo_25365480.mp3
common_voice_eo_25365532.mp3
common_voice_eo_25365695.mp3
common_voice_eo_25365744.mp3
common_voice_eo_25365804.mp3
common_voice_eo_25365836.mp3
common_voice_eo_25365855.mp3
common_voice_eo_25372587.mp3
common_voice_eo_25401060.mp3
common_voice_eo_25430837.mp3
common_voice_eo_25444509.mp3
common_voice_eo_25240777.mp3
common_voice_eo_24942754.mp3
common_voice_eo_24942755.mp3
common_voice_eo_24990372.mp3
common_voice_eo_24990385.mp3
common_voice_eo_24990390.mp3
common_voice_eo_24990397.mp3
common_voice_eo_24990413.mp3
common_voice_eo_24990427.mp3
common_voice_eo_24990429.mp3
common_voice_eo_24990435.mp3
common_voice_eo_24990441.mp3
common_voice_eo_24990454.mp3
common_voice_eo_24990457.mp3
common_voice_eo_24990459.mp3
common_voice_eo_24990490.mp3
common_voice_eo_25529345.mp3
common_voice_eo_25648750.mp3
common_voice_eo_28670472.mp3
common_voice_eo_27931966.mp3
common_voice_eo_28252265.mp3
common_voice_eo_25454951.mp3
common_voice_eo_25927616.mp3
common_voice_eo_25153203.mp3
common_voice_eo_25238543.mp3
common_voice_eo_25284237.mp3
common_voice_eo_25460131.mp3
common_voice_eo_25460185.mp3
common_voice_eo_25460186.mp3
common_voice_eo_25460188.mp3
common_voice_eo_25460189.mp3
common_voice_eo_25446723.mp3
common_voice_eo_26025150.mp3
common_voice_eo_26640189.mp3
common_voice_eo_26888468.mp3
common_voice_eo_24844824.mp3
common_voice_eo_25022506.mp3
common_voice_eo_25022507.mp3
common_voice_eo_25022516.mp3
common_voice_eo_25032858.mp3
common_voice_eo_25032859.mp3
common_voice_eo_25032865.mp3
common_voice_eo_25243988.mp3
common_voice_eo_25244009.mp3
common_voice_eo_25266094.mp3
common_voice_eo_25266141.mp3
common_voice_eo_25285278.mp3
common_voice_eo_25286768.mp3
common_voice_eo_25457171.mp3
common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3
common_voice_eo_25975149.mp3
common_voice_eo_26193748.mp3
common_voice_eo_28401039.mp3
common_voice_eo_28421315.mp3
common_voice_eo_28937347.mp3
common_voice_eo_24890414.mp3
common_voice_eo_25294479.mp3
common_voice_eo_25438966.mp3
common_voice_eo_28855568.mp3
common_voice_eo_29011007.mp3
common_voice_eo_24599888.mp3
common_voice_eo_26964252.mp3
common_voice_eo_26964496.mp3
common_voice_eo_26964510.mp3
common_voice_eo_25432789.mp3
common_voice_eo_26688158.mp3
common_voice_eo_28516354.mp3
common_voice_eo_24790865.mp3
common_voice_eo_24790897.mp3
common_voice_eo_24790898.mp3
common_voice_eo_24790899.mp3
common_voice_eo_24790900.mp3
common_voice_eo_25362713.mp3
common_voice_eo_27585084.mp3
common_voice_eo_24813131.mp3
common_voice_eo_25035262.mp3
common_voice_eo_26000289.mp3
common_voice_eo_26003943.mp3
common_voice_eo_26283983.mp3
common_voice_eo_28708931.mp3
common_voice_eo_28037217.mp3
common_voice_eo_29273106.mp3
common_voice_eo_26006657.mp3
common_voice_eo_25399924.mp3
common_voice_eo_27982431.mp3
common_voice_eo_25893779.mp3
common_voice_eo_27842061.mp3
common_voice_eo_25052385.mp3
common_voice_eo_25807395.mp3
common_voice_eo_25807985.mp3
common_voice_eo_25808039.mp3
common_voice_eo_25808407.mp3
common_voice_eo_25809036.mp3
common_voice_eo_27487795.mp3
common_voice_eo_28460556.mp3
common_voice_eo_28884851.mp3
common_voice_eo_24819719.mp3
common_voice_eo_25153594.mp3
common_voice_eo_25234585.mp3
common_voice_eo_25245164.mp3
common_voice_eo_27538877.mp3
common_voice_eo_24862771.mp3
common_voice_eo_25070167.mp3
common_voice_eo_26381720.mp3
common_voice_eo_28110376.mp3
```

### Alternative 3.1

Of those files that have no or distorted audio, maybe change their target to be empty? Except for 'injabum'.

### And also

Since one can sign up at Common Voice to review Esperanto audio files, I've done so in the hopes of making a small contribution in quality.