Edit model card


This is an encoder-decoder model for automatic speech recognition trained on on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from jonatasgrosman/wav2vec2-large-xlsr-53-german and the decoder from dbmdz/german-gpt2.

It was trained using a two step process:

  • fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
    • relatively fast training
    • also works on small GPU (eg. 8 GB)
    • but may take a lot of disk space
    • should already yield decent results
  • fine-tuning the model end-to-end
    • much slower
    • needs a bigger GPU

There is also one trick, which seemed to improve performance significantly: adding position embeddings to the encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See eval.py).

The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning rate schedule.

Downloads last month
Hosted inference API
or or
This model can be loaded on the Inference API on-demand.

Dataset used to train jsnfly/wav2vec2-large-xlsr-53-german-gpt2

Evaluation results