Whisper finetuned on Swedish Speech

Whisper is a state-of-the-art automatic speech recognition(ASR) model created by OpenAI. It is able to translate and transcribe multiple different languages. In this project the "small" Whisper model with 244M parameters was used. The dataset that wasa used for fine-tuning the Whisper model was the Swedish subset of the Mozilla foundation common voice 11 dataset.

Each audio in the dataset will be truncated or padded to 30 second snippets and then converted to the log-Mel spectrogram. Once they are in the form of log-Mel spectrograms they will be sent into the Whisper model architecture. Training was done on Google Colab and during the training checkpoints were saved to google drive in case of disconnections. Additionally the models were also pushed to a huggingface model repo along with the tensorboard data to visualize the metrics.

Training Hyperparameters

Hyperparameter	Value
`num_train_epochs`	1
`per_device_train_batch_size`	16
`gradient_accumulation_steps`	1
`learning_rate`	1e-4
`warmup_steps`	50
`max_steps`	1000
`gradient_checkpointing`	True
`fp16`	True
`per_device_eval_batch_size`	8
`generation_max_length`	225
`save_steps`	250
`eval_steps`	250

rezaqorbani
/

WhisperCheckpoints

Whisper finetuned on Swedish Speech

Training Hyperparameters

Dataset used to train rezaqorbani/WhisperCheckpoints