wav2vec2-common_voice_13_0-eo-3, an Esperanto speech recognizer

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the mozilla-foundation/common_voice_13_0 Esperanto dataset. It achieves the following results on the evaluation set:

Loss: 0.2191
Cer: 0.0208
Wer: 0.0687

The first 10 samples in the test set:

Actual Predicted	CER
`la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo` `la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo`	0.0
`en la sekva jaro li ricevis premion` `en la sekva jaro li ricevis prenion`	0.02857142857142857
`ŝi studis historion ĉe la universitato de brita kolumbio` `ŝi studis historion ĉe la universitato de brita kolumbio`	0.0
`larĝaj ŝtupoj kuras al la fasado` `larĝaj ŝtupoj kuras al la fasado`	0.0
`la municipo ĝuas duan epokon de etendo kaj disvolviĝo` `la municipo ĝuas duonepokon de tendo kaj disvolviĝo`	0.05660377358490566
`li estis ankaŭ katedrestro kaj dekano` `li estis ankaŭ katedresto kaj dekano`	0.02702702702702703
`librovendejo apartenas al la muzeo` `librovendejo apartenas al la muzeo`	0.0
`ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj` `ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj`	0.02702702702702703
`unue ili estas ruĝaj poste brunaj` `unue ili estas ruĝaj poste brunaj`	0.0
`la loĝantaro laboras en la proksima ĉefurbo` `la loĝantaro laboras en la proksima ĉefurbo`	0.0

Model description

See facebook/wav2vec2-large-xlsr-53.

Intended uses & limitations

Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.

Training and evaluation data

The training split was set to train[:15000] while the eval split was set to validation[:1500].

Training procedure

I used run_speech_recognition_ctc.py with the following train.json file passed to it:

{
  "dataset_name": "mozilla-foundation/common_voice_13_0",
  "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "dataset_config_name": "eo",
  "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
  "train_split_name": "train[:15000]",
  "eval_split_name": "validation[:1500]",
  "eval_metrics": ["cer", "wer"],
  "overwrite_output_dir": true,
  "preprocessing_num_workers": 8,
  "num_train_epochs": 100,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": true,
  "learning_rate": 3e-5,
  "warmup_steps": 500,
  "evaluation_strategy": "steps",
  "text_column_name": "sentence",
  "length_column_name": "input_length",
  "save_steps": 1000,
  "eval_steps": 1000,
  "layerdrop": 0.1,
  "save_total_limit": 3,
  "freeze_feature_encoder": true,
  "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫？",
  "chars_to_substitute": {
    "przy": "pŝe",
    "byn": "bin",
    "cx": "ĉ",
    "sx": "ŝ",
    "ﬁ": "fi",
    "ﬂ": "fl",
    "ǔ": "ŭ",
    "ñ": "nj",
    "á": "a",
    "é": "e",
    "ü": "ŭ",
    "y": "j",
    "qu": "ku"
  },
  "fp16": true,
  "group_by_length": true,
  "push_to_hub": true,
  "do_train": true,
  "do_eval": true
}

I went through the dataset to find non-speech characters, and these were placed in chars_to_ignore. In addition, there were character sequences that could be transcribed to Esperanto phonemes, and these were placed as a dictionary in chars_to_substitute. This required adding such an argument to the program:

def dict_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)

@dataclass
class DataTrainingArguments:
  ...
    chars_to_substitute: Optional[Dict[str, str]] = dict_field(
        default=None,
        metadata={"help": "A dict of characters to replace."},
    )

Then I copied remove_special_characters to do the actual substitution:

    def remove_special_characters(batch):
        text = batch[text_column_name]
        if chars_to_ignore_regex is not None:
            text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
        batch["target_text"] = text.lower() + " "
        return batch

    def substitute_characters(batch):
        text: str = batch["target_text"]
        if data_args.chars_to_substitute is not None:
            for k, v in data_args.chars_to_substitute.items():
                text.replace(k, v)
        batch["target_text"] = text.lower()
        return batch

    with training_args.main_process_first(desc="dataset map special characters removal"):
        raw_datasets = raw_datasets.map(
            remove_special_characters,
            remove_columns=[text_column_name],
            desc="remove special characters from datasets",
        )

    with training_args.main_process_first(desc="dataset map special characters substitute"):
        raw_datasets = raw_datasets.map(
            substitute_characters,
            desc="substitute special characters in datasets",
        )

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
layerdrop: 0.1
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 100
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Cer	Validation Loss	Wer
2.6416	2.13	1000	0.1541	0.8599	0.6449
0.2633	4.27	2000	0.0335	0.1897	0.1431
0.1739	6.4	3000	0.0289	0.1732	0.1145
0.1378	8.53	4000	0.0276	0.1729	0.1066
0.1172	10.67	5000	0.0268	0.1773	0.1019
0.1049	12.8	6000	0.0255	0.1701	0.0937
0.0951	14.93	7000	0.0253	0.1718	0.0933
0.0851	17.07	8000	0.0239	0.1787	0.0834
0.0809	19.2	9000	0.0235	0.1802	0.0835
0.0756	21.33	10000	0.0239	0.1784	0.0855
0.0708	23.47	11000	0.0235	0.1748	0.0824
0.0657	25.6	12000	0.0228	0.1830	0.0796
0.0605	27.73	13000	0.0230	0.1896	0.0798
0.0583	29.87	14000	0.0224	0.1889	0.0778
0.0608	32.0	15000	0.0223	0.1849	0.0757
0.0556	34.13	16000	0.0223	0.1872	0.0767
0.0534	36.27	17000	0.0221	0.1893	0.0751
0.0523	38.4	18000	0.0218	0.1925	0.0729
0.0494	40.53	19000	0.0221	0.1957	0.0745
0.0475	42.67	20000	0.0217	0.1961	0.0740
0.048	44.8	21000	0.0214	0.1957	0.0714
0.0459	46.93	22000	0.0215	0.1968	0.0717
0.0435	49.07	23000	0.0217	0.2008	0.0717
0.0428	51.2	24000	0.0212	0.1991	0.0696
0.0418	53.33	25000	0.0215	0.2034	0.0714
0.0404	55.47	26000	0.0210	0.2014	0.0684
0.0394	57.6	27000	0.0210	0.2050	0.0681
0.0399	59.73	28000	0.0211	0.2039	0.0700
0.0389	61.87	29000	0.0214	0.2091	0.0694
0.038	64.0	30000	0.0210	0.2100	0.0702
0.0361	66.13	31000	0.0215	0.2119	0.0703
0.0359	68.27	32000	0.0213	0.2108	0.0714
0.0354	70.4	33000	0.0211	0.2120	0.0699
0.0364	72.53	34000	0.0211	0.2128	0.0688
0.0361	74.67	35000	0.0212	0.2134	0.0694
0.0332	76.8	36000	0.0210	0.2176	0.0698
0.0341	78.93	37000	0.0208	0.2170	0.0688
0.032	81.07	38000	0.0209	0.2157	0.0686
0.0318	83.33	39000	0.0209	0.2166	0.0685
0.0325	85.47	40000	0.0209	0.2172	0.0687
0.0316	87.6	41000	0.0208	0.2181	0.0678
0.0302	89.73	42000	0.0208	0.2171	0.0679
0.0318	91.87	43000	0.0211	0.2179	0.0702
0.0314	94.0	44000	0.0208	0.2186	0.0690
0.0309	96.13	45000	0.0210	0.2193	0.0696
0.031	98.27	46000	0.0208	0.2191	0.0686

Framework versions

Transformers 4.29.1
Pytorch 2.0.1+cu118
Datasets 2.12.0
Tokenizers 0.13.3

Discussion

Nans and Infs

While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either inf or nan -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:

file	Actual --- Predicted	CER	Comment
common_voice_eo_25365027.mp3	en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj --- a taaj keo eoj eejn kigos eegoj eioeegiooj	0.61	No audio
common_voice_eo_25365472.mp3	ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon --- ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon	0.55	Barely any audio, distorted
common_voice_eo_25365836.mp3	industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon --- iiti sieetas la eeadooddddooiooaotooeioj aiicenon	0.67	Barely any audio, distorted
2600	ili akiras plenkreskan plumaron nur en la kvina jaro --- ili aaros peetaj patato a a sia ro	0.52	It's literally someone saying 'injabum'. Thanks, troll.
7333	poste sekvas difinoj de la termino --- po	0.94	No audio
7334	li gvidis multajn kursojn laŭ la csehmetodo --- po	0.98	No audio
7429	tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete --- po	0.97	No audio
11662	lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj --- linkonteto estastitot etateerteito en pootaeaje lgijoj	0.58	No audio

Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.

You can see that the model is trying to hallucinate the target when there's little or no audio. This is terrible for realistically reporting what was said. I'd also hope that there is some measure of certainty, and maybe only go with transcriptions that have relatively high certainty. However, I can't find how to get at a certainty value.

The Common Voice dataset also contains upvotes and downvotes. Of the high CER sentences above, all had 2 upvotes, with some having 0 downvotes, and some having 1. So we cannot rely on upvotes or downvotes to detect quality.

So what to do?

Alternative 1

Despite these zero- and low-quality files, training seems to work OK. However, we still need to address when loss becomes nan or inf because that ruins the calculation.

By running run_speech_recognition_ctc with do_train=false, setting model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3", setting eval_split_name to either test, validation, or train, and also modifying trainer.py as follows, I can check if any losses are nan or inf:

        # To be JSON-serializable, we need to remove numpy types or zero-d tensors
        metrics = denumpify_detensorize(metrics)

        if all_losses is not None:
            loss_nan = np.where(np.isnan(all_losses))
            if len(loss_nan) != 0:
                print(f'LOSSES ARE NAN: {loss_nan}')
            loss_inf = np.where(np.isinf(all_losses))
            if len(loss_inf) != 0:
                print(f'LOSSES ARE INF: {loss_inf}')
            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()

Doing this shows that of the 14913 examples in test, the following example results in inf loss:

common_voice_eo_25167318.mp3

The audio on this is severly garbled. This should absolutely be filtered out of the test set.

No validation samples result in inf or nan.

The following 18 out of 143984 examples in train result in inf loss:

common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3

Those files have no audio.

Alternative 2

Another possibility is just to go through the audio files and throw away any where the peak audio isn't above some threshold.

Alternative 3

Since this model seems to work well enough, I could run inference on all samples, and just discard the ones where the CER (as determined by this model) is too high, say above 0.5. Then use that to filter the examples and train another model. These high-CER examples are:

Test set

71 of 14913 examples in the test set show high CER.

common_voice_eo_25214319.mp3
common_voice_eo_25006596.mp3
common_voice_eo_27472721.mp3
common_voice_eo_27715088.mp3
common_voice_eo_27715091.mp3
common_voice_eo_26677019.mp3
common_voice_eo_26677023.mp3
common_voice_eo_20555291.mp3
common_voice_eo_25001942.mp3
common_voice_eo_25457354.mp3
common_voice_eo_25457355.mp3
common_voice_eo_25457365.mp3
common_voice_eo_25457373.mp3
common_voice_eo_25457396.mp3
common_voice_eo_25457397.mp3
common_voice_eo_25457409.mp3
common_voice_eo_25457410.mp3
common_voice_eo_25457412.mp3
common_voice_eo_25457442.mp3
common_voice_eo_25457444.mp3
common_voice_eo_25457445.mp3
common_voice_eo_25457577.mp3
common_voice_eo_25457578.mp3
common_voice_eo_28064453.mp3
common_voice_eo_25047803.mp3
common_voice_eo_25048418.mp3
common_voice_eo_25048419.mp3
common_voice_eo_25048421.mp3
common_voice_eo_25048423.mp3
common_voice_eo_25048428.mp3
common_voice_eo_25048574.mp3
common_voice_eo_25885643.mp3
common_voice_eo_25885645.mp3
common_voice_eo_26794882.mp3
common_voice_eo_27356529.mp3
common_voice_eo_25012640.mp3
common_voice_eo_25303457.mp3
common_voice_eo_18153931.mp3
common_voice_eo_18776206.mp3
common_voice_eo_18776208.mp3
common_voice_eo_18776219.mp3
common_voice_eo_18776220.mp3
common_voice_eo_18776222.mp3
common_voice_eo_18776223.mp3
common_voice_eo_18776236.mp3
common_voice_eo_18776238.mp3
common_voice_eo_18776244.mp3
common_voice_eo_18776248.mp3
common_voice_eo_18776285.mp3
common_voice_eo_18776287.mp3
common_voice_eo_18776297.mp3
common_voice_eo_18776298.mp3
common_voice_eo_25047998.mp3
common_voice_eo_25047999.mp3
common_voice_eo_25048000.mp3
common_voice_eo_25048001.mp3
common_voice_eo_25048002.mp3
common_voice_eo_25053113.mp3
common_voice_eo_25068355.mp3
common_voice_eo_25333056.mp3
common_voice_eo_25371639.mp3
common_voice_eo_25371640.mp3
common_voice_eo_25371641.mp3
common_voice_eo_25371642.mp3
common_voice_eo_25371643.mp3
common_voice_eo_22441946.mp3
common_voice_eo_26622121.mp3
common_voice_eo_25167318.mp3
common_voice_eo_25252685.mp3
common_voice_eo_25252698.mp3
common_voice_eo_25518636.mp3

Note on two of the examples: We know that saluton kiel vi fartas ("Hello, how are you") and atendu momenton ("Wait a moment") is a good start in learning Esperanto, but if that's not the text to record, you're not really helping.

Validation set

17 of 14909 examples in the test set show high CER.

common_voice_eo_25392669.mp3
common_voice_eo_25392674.mp3
common_voice_eo_25392675.mp3
common_voice_eo_25392676.mp3
common_voice_eo_25392678.mp3
common_voice_eo_25392693.mp3
common_voice_eo_25392694.mp3
common_voice_eo_25392695.mp3
common_voice_eo_25392697.mp3
common_voice_eo_25392701.mp3
common_voice_eo_25392702.mp3
common_voice_eo_25392708.mp3
common_voice_eo_25392709.mp3
common_voice_eo_25408881.mp3
common_voice_eo_25408882.mp3
common_voice_eo_25408885.mp3
common_voice_eo_27380623.mp3

I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.

Training set

135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.

common_voice_eo_25365027.mp3
common_voice_eo_25365472.mp3
common_voice_eo_25365480.mp3
common_voice_eo_25365532.mp3
common_voice_eo_25365695.mp3
common_voice_eo_25365744.mp3
common_voice_eo_25365804.mp3
common_voice_eo_25365836.mp3
common_voice_eo_25365855.mp3
common_voice_eo_25372587.mp3
common_voice_eo_25401060.mp3
common_voice_eo_25430837.mp3
common_voice_eo_25444509.mp3
common_voice_eo_25240777.mp3
common_voice_eo_24942754.mp3
common_voice_eo_24942755.mp3
common_voice_eo_24990372.mp3
common_voice_eo_24990385.mp3
common_voice_eo_24990390.mp3
common_voice_eo_24990397.mp3
common_voice_eo_24990413.mp3
common_voice_eo_24990427.mp3
common_voice_eo_24990429.mp3
common_voice_eo_24990435.mp3
common_voice_eo_24990441.mp3
common_voice_eo_24990454.mp3
common_voice_eo_24990457.mp3
common_voice_eo_24990459.mp3
common_voice_eo_24990490.mp3
common_voice_eo_25529345.mp3
common_voice_eo_25648750.mp3
common_voice_eo_28670472.mp3
common_voice_eo_27931966.mp3
common_voice_eo_28252265.mp3
common_voice_eo_25454951.mp3
common_voice_eo_25927616.mp3
common_voice_eo_25153203.mp3
common_voice_eo_25238543.mp3
common_voice_eo_25284237.mp3
common_voice_eo_25460131.mp3
common_voice_eo_25460185.mp3
common_voice_eo_25460186.mp3
common_voice_eo_25460188.mp3
common_voice_eo_25460189.mp3
common_voice_eo_25446723.mp3
common_voice_eo_26025150.mp3
common_voice_eo_26640189.mp3
common_voice_eo_26888468.mp3
common_voice_eo_24844824.mp3
common_voice_eo_25022506.mp3
common_voice_eo_25022507.mp3
common_voice_eo_25022516.mp3
common_voice_eo_25032858.mp3
common_voice_eo_25032859.mp3
common_voice_eo_25032865.mp3
common_voice_eo_25243988.mp3
common_voice_eo_25244009.mp3
common_voice_eo_25266094.mp3
common_voice_eo_25266141.mp3
common_voice_eo_25285278.mp3
common_voice_eo_25286768.mp3
common_voice_eo_25457171.mp3
common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3
common_voice_eo_25975149.mp3
common_voice_eo_26193748.mp3
common_voice_eo_28401039.mp3
common_voice_eo_28421315.mp3
common_voice_eo_28937347.mp3
common_voice_eo_24890414.mp3
common_voice_eo_25294479.mp3
common_voice_eo_25438966.mp3
common_voice_eo_28855568.mp3
common_voice_eo_29011007.mp3
common_voice_eo_24599888.mp3
common_voice_eo_26964252.mp3
common_voice_eo_26964496.mp3
common_voice_eo_26964510.mp3
common_voice_eo_25432789.mp3
common_voice_eo_26688158.mp3
common_voice_eo_28516354.mp3
common_voice_eo_24790865.mp3
common_voice_eo_24790897.mp3
common_voice_eo_24790898.mp3
common_voice_eo_24790899.mp3
common_voice_eo_24790900.mp3
common_voice_eo_25362713.mp3
common_voice_eo_27585084.mp3
common_voice_eo_24813131.mp3
common_voice_eo_25035262.mp3
common_voice_eo_26000289.mp3
common_voice_eo_26003943.mp3
common_voice_eo_26283983.mp3
common_voice_eo_28708931.mp3
common_voice_eo_28037217.mp3
common_voice_eo_29273106.mp3
common_voice_eo_26006657.mp3
common_voice_eo_25399924.mp3
common_voice_eo_27982431.mp3
common_voice_eo_25893779.mp3
common_voice_eo_27842061.mp3
common_voice_eo_25052385.mp3
common_voice_eo_25807395.mp3
common_voice_eo_25807985.mp3
common_voice_eo_25808039.mp3
common_voice_eo_25808407.mp3
common_voice_eo_25809036.mp3
common_voice_eo_27487795.mp3
common_voice_eo_28460556.mp3
common_voice_eo_28884851.mp3
common_voice_eo_24819719.mp3
common_voice_eo_25153594.mp3
common_voice_eo_25234585.mp3
common_voice_eo_25245164.mp3
common_voice_eo_27538877.mp3
common_voice_eo_24862771.mp3
common_voice_eo_25070167.mp3
common_voice_eo_26381720.mp3
common_voice_eo_28110376.mp3

Alternative 3.1

Of those files that have no or distorted audio, maybe change their target to be empty? Except for 'injabum'.

And also

Since one can sign up at Common Voice to review Esperanto audio files, I've done so in the hopes of making a small contribution in quality.

xekri
/

wav2vec2-common_voice_13_0-eo-3