Fine-tuning of `wav2vec2-base` on 100h of Librispeech training data. Results on "clean" data are very similar to the ones of the [official model]( However, the result on "other" is significantly worse - the model seems to have overfitting to the "clean" data. Model was trained on *librispeech-clean-train.100* with following hyper-parameters: - 2 GPUs Titan RTX - Total update steps 13000 - Batch size per GPU: 32 corresponding to a *total batch size* of ca. ~1500 seconds - Adam with linear decaying learning rate with 3000 warmup steps - dynamic grouping for batch - fp16 - attention_mask was **not** used during training Check: *Result (WER)* on Librispeech test: | "clean" | "other" | |---|---| | 6.5 | 18.7 |