MPS backend
Can it be fine-tuned using MPS backend?
yes,you can. On macOS you can change the following code
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
to this one:
device = "mps" if torch.backends.mps.is_available() else "cpu"
torch_dtype = torch.float16 if torch.backends.mps.is_available() else torch.float32
You should be able to then run fine-tuning as per this blog post with the HF Trainer - simply swap out the model checkpoint openai/whisper-small
for distil-whisper/distil-large-v2
When swapping out the model checkpoint openai/whisper-small for distil-whisper/distil-large-v2, the fine-tune process is much slower than when using openai/whisper-small. Does this have to do with pytorch with MPS backend, some operators (used with distill-whisper) not implemented?
All the operators are the same, since there is no code change going from openai/whisper-small
-> distil-whisper/distil-large-v2
. Both use this modelling file: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py
Whisper small is 242M parameters, whereas Distil-Whisper large-v2 is 756M. This is because the encoder for Distil-Whisper large-v2 is much wider and deeper (32 layers) than Whisper small (12 layers). That means the forward and backward propagation through the encoder takes much longer, which is likely the reason for slower training.
In short: Distil-Whisper large-v2 is a larger model than Whisper small (even though it's faster), so training takes longer.