Second fine-tuning try of wav2vec2-base. Results are similar to the ones reported in https://huggingface.co/facebook/wav2vec2-base-100h.

Model was trained on librispeech-clean-train.100 with following hyper-parameters:

  • 2 GPUs Titan RTX
  • Total update steps 11000
  • Batch size per GPU: 32 corresponding to a total batch size of ca. ~750 seconds
  • Adam with linear decaying learning rate with 3000 warmup steps
  • dynamic padding for batch
  • fp16
  • attention_mask was not used during training

Check: https://wandb.ai/patrickvonplaten/huggingface/runs/1yrpescx?workspace=user-patrickvonplaten

Result (WER) on Librispeech:

"clean" (% rel difference to results in paper) "other" (% rel difference to results in paper)
6.2 (-1.6%) 15.2 (-11.2%)
