MPS backend

#13

by kadriu - opened Nov 7, 2023

Discussion

kadriu

Nov 7, 2023

Can it be fine-tuned using MPS backend?

iPanda

Nov 10, 2023

yes，you can. On macOS you can change the following code
device = "cuda:0" if torch.cuda.is_available() else "cpu"

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
to this one:
device = "mps" if torch.backends.mps.is_available() else "cpu"

torch_dtype = torch.float16 if torch.backends.mps.is_available() else torch.float32

sanchit-gandhi

Whisper Distillation org Nov 13, 2023

You should be able to then run fine-tuning as per this blog post with the HF Trainer - simply swap out the model checkpoint openai/whisper-small for distil-whisper/distil-large-v2

kadriu

Dec 11, 2023

When swapping out the model checkpoint openai/whisper-small for distil-whisper/distil-large-v2, the fine-tune process is much slower than when using openai/whisper-small. Does this have to do with pytorch with MPS backend, some operators (used with distill-whisper) not implemented?

sanchit-gandhi

Whisper Distillation org Dec 12, 2023

All the operators are the same, since there is no code change going from openai/whisper-small -> distil-whisper/distil-large-v2. Both use this modelling file: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py

Whisper small is 242M parameters, whereas Distil-Whisper large-v2 is 756M. This is because the encoder for Distil-Whisper large-v2 is much wider and deeper (32 layers) than Whisper small (12 layers). That means the forward and backward propagation through the encoder takes much longer, which is likely the reason for slower training.

In short: Distil-Whisper large-v2 is a larger model than Whisper small (even though it's faster), so training takes longer.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment