xekri commited on
Commit
bfa2a0a
1 Parent(s): 7261829

Updates README with discussion

Browse files
Files changed (1) hide show
  1. README.md +340 -0
README.md CHANGED
@@ -224,3 +224,343 @@ The following hyperparameters were used during training:
224
  - Pytorch 2.0.1+cu118
225
  - Datasets 2.12.0
226
  - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
  - Pytorch 2.0.1+cu118
225
  - Datasets 2.12.0
226
  - Tokenizers 0.13.3
227
+
228
+ ## Discussion
229
+
230
+ ### Nans and Infs
231
+
232
+ While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
233
+
234
+ | file | Actual<br>Predicted | CER | Comment |
235
+ |:-----|:--------------------|:----|:--------|
236
+ |common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
237
+ |common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
238
+ |common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
239
+ |2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
240
+ |7333 | poste sekvas difinoj de la termino<br>po | 0.94 | No audio |
241
+ |7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>po | 0.98 | No audio |
242
+ |7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>po | 0.97 | No audio |
243
+ |11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
244
+
245
+ Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
246
+
247
+ You can see that the model is trying to hallucinate the target when there's little or no audio. This is terrible for realistically reporting what was said. I'd also hope that there is some measure of certainty, and maybe only go with transcriptions that have relatively high certainty. However, I can't find how to get at a certainty value.
248
+
249
+ The Common Voice dataset also contains upvotes and downvotes. Of the high CER sentences above, all had 2 upvotes, with some having 0 downvotes, and some having 1. So we cannot rely on upvotes or downvotes to detect quality.
250
+
251
+ So what to do?
252
+
253
+ ### Alternative 1
254
+
255
+ Despite these zero- and low-quality files, training seems to work OK. However, we still need to address when loss becomes `nan` or `inf` because that ruins the calculation.
256
+
257
+ By running `run_speech_recognition_ctc` with `do_train=false`, setting `model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3"`, setting `eval_split_name` to either `test`, `validation`, or `train`, and also modifying `trainer.py` as follows, I can check if any losses are nan or inf:
258
+
259
+ ```py
260
+ # To be JSON-serializable, we need to remove numpy types or zero-d tensors
261
+ metrics = denumpify_detensorize(metrics)
262
+
263
+ if all_losses is not None:
264
+ loss_nan = np.where(np.isnan(all_losses))
265
+ if len(loss_nan) != 0:
266
+ print(f'LOSSES ARE NAN: {loss_nan}')
267
+ loss_inf = np.where(np.isinf(all_losses))
268
+ if len(loss_inf) != 0:
269
+ print(f'LOSSES ARE INF: {loss_inf}')
270
+ metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
271
+ ```
272
+
273
+ Doing this shows that of the 14913 examples in `test`, the following file results in `inf` loss:
274
+
275
+ `common_voice_eo_25167318.mp3`
276
+
277
+ The audio on this is severly garbled. This should absolutely be filtered out of the test set.
278
+
279
+ No `validation` samples result in `inf` or `nan`.
280
+
281
+ The following files out of the 143984 examples in `train` result in `inf` loss:
282
+
283
+ ```txt
284
+ common_voice_eo_25467641.mp3
285
+ common_voice_eo_25467723.mp3
286
+ common_voice_eo_25467791.mp3
287
+ common_voice_eo_25467820.mp3
288
+ common_voice_eo_25467943.mp3
289
+ common_voice_eo_25478612.mp3
290
+ common_voice_eo_25478623.mp3
291
+ common_voice_eo_25478631.mp3
292
+ common_voice_eo_25478756.mp3
293
+ common_voice_eo_25478762.mp3
294
+ common_voice_eo_25478768.mp3
295
+ common_voice_eo_25478769.mp3
296
+ common_voice_eo_25479150.mp3
297
+ common_voice_eo_25479203.mp3
298
+ common_voice_eo_25479229.mp3
299
+ common_voice_eo_25517673.mp3
300
+ common_voice_eo_25517677.mp3
301
+ common_voice_eo_25527739.mp3
302
+ ```
303
+
304
+ Those files have no audio.
305
+
306
+ ### Alternative 2
307
+
308
+ Another possibility is just to go through the audio files and throw away any where the peak audio isn't above some threshold.
309
+
310
+ ### Alternative 3
311
+
312
+ Since this model seems to work well enough, I could run inference on all samples, and just discard the ones where the CER (as determined by this model) is too high, say above 0.5. Then use that to filter the examples and train another model. These high-CER examples are:
313
+
314
+ #### Test set
315
+
316
+ ```txt
317
+ common_voice_eo_25214319.mp3
318
+ common_voice_eo_25006596.mp3
319
+ common_voice_eo_27472721.mp3
320
+ common_voice_eo_27715088.mp3
321
+ common_voice_eo_27715091.mp3
322
+ common_voice_eo_26677019.mp3
323
+ common_voice_eo_26677023.mp3
324
+ common_voice_eo_20555291.mp3
325
+ common_voice_eo_25001942.mp3
326
+ common_voice_eo_25457354.mp3
327
+ common_voice_eo_25457355.mp3
328
+ common_voice_eo_25457365.mp3
329
+ common_voice_eo_25457373.mp3
330
+ common_voice_eo_25457396.mp3
331
+ common_voice_eo_25457397.mp3
332
+ common_voice_eo_25457409.mp3
333
+ common_voice_eo_25457410.mp3
334
+ common_voice_eo_25457412.mp3
335
+ common_voice_eo_25457442.mp3
336
+ common_voice_eo_25457444.mp3
337
+ common_voice_eo_25457445.mp3
338
+ common_voice_eo_25457577.mp3
339
+ common_voice_eo_25457578.mp3
340
+ common_voice_eo_28064453.mp3
341
+ common_voice_eo_25047803.mp3
342
+ common_voice_eo_25048418.mp3
343
+ common_voice_eo_25048419.mp3
344
+ common_voice_eo_25048421.mp3
345
+ common_voice_eo_25048423.mp3
346
+ common_voice_eo_25048428.mp3
347
+ common_voice_eo_25048574.mp3
348
+ common_voice_eo_25885643.mp3
349
+ common_voice_eo_25885645.mp3
350
+ common_voice_eo_26794882.mp3
351
+ common_voice_eo_27356529.mp3
352
+ common_voice_eo_25012640.mp3
353
+ common_voice_eo_25303457.mp3
354
+ common_voice_eo_18153931.mp3
355
+ common_voice_eo_18776206.mp3
356
+ common_voice_eo_18776208.mp3
357
+ common_voice_eo_18776219.mp3
358
+ common_voice_eo_18776220.mp3
359
+ common_voice_eo_18776222.mp3
360
+ common_voice_eo_18776223.mp3
361
+ common_voice_eo_18776236.mp3
362
+ common_voice_eo_18776238.mp3
363
+ common_voice_eo_18776244.mp3
364
+ common_voice_eo_18776248.mp3
365
+ common_voice_eo_18776285.mp3
366
+ common_voice_eo_18776287.mp3
367
+ common_voice_eo_18776297.mp3
368
+ common_voice_eo_18776298.mp3
369
+ common_voice_eo_25047998.mp3
370
+ common_voice_eo_25047999.mp3
371
+ common_voice_eo_25048000.mp3
372
+ common_voice_eo_25048001.mp3
373
+ common_voice_eo_25048002.mp3
374
+ common_voice_eo_25053113.mp3
375
+ common_voice_eo_25068355.mp3
376
+ common_voice_eo_25333056.mp3
377
+ common_voice_eo_25371639.mp3
378
+ common_voice_eo_25371640.mp3
379
+ common_voice_eo_25371641.mp3
380
+ common_voice_eo_25371642.mp3
381
+ common_voice_eo_25371643.mp3
382
+ common_voice_eo_22441946.mp3
383
+ common_voice_eo_26622121.mp3
384
+ common_voice_eo_25167318.mp3
385
+ common_voice_eo_25252685.mp3
386
+ common_voice_eo_25252698.mp3
387
+ common_voice_eo_25518636.mp3
388
+ ```
389
+
390
+ Note on `test[100]` and `test[101]`: We know that `saluton kiel vi fartas` and `atendu momenton` is a good start, but if that's not the text to record, you're not really helping.
391
+
392
+ #### Validation set
393
+
394
+ 141 of
395
+ ```txt
396
+ common_voice_eo_25392669.mp3
397
+ common_voice_eo_25392674.mp3
398
+ common_voice_eo_25392675.mp3
399
+ common_voice_eo_25392676.mp3
400
+ common_voice_eo_25392678.mp3
401
+ common_voice_eo_25392693.mp3
402
+ common_voice_eo_25392694.mp3
403
+ common_voice_eo_25392695.mp3
404
+ common_voice_eo_25392697.mp3
405
+ common_voice_eo_25392701.mp3
406
+ common_voice_eo_25392702.mp3
407
+ common_voice_eo_25392708.mp3
408
+ common_voice_eo_25392709.mp3
409
+ common_voice_eo_25408881.mp3
410
+ common_voice_eo_25408882.mp3
411
+ common_voice_eo_25408885.mp3
412
+ common_voice_eo_27380623.mp3
413
+ ```
414
+
415
+ I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
416
+
417
+
418
+ #### Training set
419
+
420
+ 135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.
421
+
422
+ ```txt
423
+ common_voice_eo_25365027.mp3
424
+ common_voice_eo_25365472.mp3
425
+ common_voice_eo_25365480.mp3
426
+ common_voice_eo_25365532.mp3
427
+ common_voice_eo_25365695.mp3
428
+ common_voice_eo_25365744.mp3
429
+ common_voice_eo_25365804.mp3
430
+ common_voice_eo_25365836.mp3
431
+ common_voice_eo_25365855.mp3
432
+ common_voice_eo_25372587.mp3
433
+ common_voice_eo_25401060.mp3
434
+ common_voice_eo_25430837.mp3
435
+ common_voice_eo_25444509.mp3
436
+ common_voice_eo_25240777.mp3
437
+ common_voice_eo_24942754.mp3
438
+ common_voice_eo_24942755.mp3
439
+ common_voice_eo_24990372.mp3
440
+ common_voice_eo_24990385.mp3
441
+ common_voice_eo_24990390.mp3
442
+ common_voice_eo_24990397.mp3
443
+ common_voice_eo_24990413.mp3
444
+ common_voice_eo_24990427.mp3
445
+ common_voice_eo_24990429.mp3
446
+ common_voice_eo_24990435.mp3
447
+ common_voice_eo_24990441.mp3
448
+ common_voice_eo_24990454.mp3
449
+ common_voice_eo_24990457.mp3
450
+ common_voice_eo_24990459.mp3
451
+ common_voice_eo_24990490.mp3
452
+ common_voice_eo_25529345.mp3
453
+ common_voice_eo_25648750.mp3
454
+ common_voice_eo_28670472.mp3
455
+ common_voice_eo_27931966.mp3
456
+ common_voice_eo_28252265.mp3
457
+ common_voice_eo_25454951.mp3
458
+ common_voice_eo_25927616.mp3
459
+ common_voice_eo_25153203.mp3
460
+ common_voice_eo_25238543.mp3
461
+ common_voice_eo_25284237.mp3
462
+ common_voice_eo_25460131.mp3
463
+ common_voice_eo_25460185.mp3
464
+ common_voice_eo_25460186.mp3
465
+ common_voice_eo_25460188.mp3
466
+ common_voice_eo_25460189.mp3
467
+ common_voice_eo_25446723.mp3
468
+ common_voice_eo_26025150.mp3
469
+ common_voice_eo_26640189.mp3
470
+ common_voice_eo_26888468.mp3
471
+ common_voice_eo_24844824.mp3
472
+ common_voice_eo_25022506.mp3
473
+ common_voice_eo_25022507.mp3
474
+ common_voice_eo_25022516.mp3
475
+ common_voice_eo_25032858.mp3
476
+ common_voice_eo_25032859.mp3
477
+ common_voice_eo_25032865.mp3
478
+ common_voice_eo_25243988.mp3
479
+ common_voice_eo_25244009.mp3
480
+ common_voice_eo_25266094.mp3
481
+ common_voice_eo_25266141.mp3
482
+ common_voice_eo_25285278.mp3
483
+ common_voice_eo_25286768.mp3
484
+ common_voice_eo_25457171.mp3
485
+ common_voice_eo_25467641.mp3
486
+ common_voice_eo_25467723.mp3
487
+ common_voice_eo_25467791.mp3
488
+ common_voice_eo_25467820.mp3
489
+ common_voice_eo_25467943.mp3
490
+ common_voice_eo_25478612.mp3
491
+ common_voice_eo_25478623.mp3
492
+ common_voice_eo_25478631.mp3
493
+ common_voice_eo_25478756.mp3
494
+ common_voice_eo_25478762.mp3
495
+ common_voice_eo_25478768.mp3
496
+ common_voice_eo_25478769.mp3
497
+ common_voice_eo_25479150.mp3
498
+ common_voice_eo_25479203.mp3
499
+ common_voice_eo_25479229.mp3
500
+ common_voice_eo_25517673.mp3
501
+ common_voice_eo_25517677.mp3
502
+ common_voice_eo_25527739.mp3
503
+ common_voice_eo_25975149.mp3
504
+ common_voice_eo_26193748.mp3
505
+ common_voice_eo_28401039.mp3
506
+ common_voice_eo_28421315.mp3
507
+ common_voice_eo_28937347.mp3
508
+ common_voice_eo_24890414.mp3
509
+ common_voice_eo_25294479.mp3
510
+ common_voice_eo_25438966.mp3
511
+ common_voice_eo_28855568.mp3
512
+ common_voice_eo_29011007.mp3
513
+ common_voice_eo_24599888.mp3
514
+ common_voice_eo_26964252.mp3
515
+ common_voice_eo_26964496.mp3
516
+ common_voice_eo_26964510.mp3
517
+ common_voice_eo_25432789.mp3
518
+ common_voice_eo_26688158.mp3
519
+ common_voice_eo_28516354.mp3
520
+ common_voice_eo_24790865.mp3
521
+ common_voice_eo_24790897.mp3
522
+ common_voice_eo_24790898.mp3
523
+ common_voice_eo_24790899.mp3
524
+ common_voice_eo_24790900.mp3
525
+ common_voice_eo_25362713.mp3
526
+ common_voice_eo_27585084.mp3
527
+ common_voice_eo_24813131.mp3
528
+ common_voice_eo_25035262.mp3
529
+ common_voice_eo_26000289.mp3
530
+ common_voice_eo_26003943.mp3
531
+ common_voice_eo_26283983.mp3
532
+ common_voice_eo_28708931.mp3
533
+ common_voice_eo_28037217.mp3
534
+ common_voice_eo_29273106.mp3
535
+ common_voice_eo_26006657.mp3
536
+ common_voice_eo_25399924.mp3
537
+ common_voice_eo_27982431.mp3
538
+ common_voice_eo_25893779.mp3
539
+ common_voice_eo_27842061.mp3
540
+ common_voice_eo_25052385.mp3
541
+ common_voice_eo_25807395.mp3
542
+ common_voice_eo_25807985.mp3
543
+ common_voice_eo_25808039.mp3
544
+ common_voice_eo_25808407.mp3
545
+ common_voice_eo_25809036.mp3
546
+ common_voice_eo_27487795.mp3
547
+ common_voice_eo_28460556.mp3
548
+ common_voice_eo_28884851.mp3
549
+ common_voice_eo_24819719.mp3
550
+ common_voice_eo_25153594.mp3
551
+ common_voice_eo_25234585.mp3
552
+ common_voice_eo_25245164.mp3
553
+ common_voice_eo_27538877.mp3
554
+ common_voice_eo_24862771.mp3
555
+ common_voice_eo_25070167.mp3
556
+ common_voice_eo_26381720.mp3
557
+ common_voice_eo_28110376.mp3
558
+ ```
559
+
560
+ ### Alternative 3.1
561
+
562
+ Of those files that have no or distorted audio, maybe change their target to be empty? Except for 'injabum'.
563
+
564
+ ### And also
565
+
566
+ Since one can sign up at Common Voice to review Esperanto audio files, I've done so in the hopes of making a small contribution in quality.