Speculative Decoding doesn't work yet with Whisper-v3

#23

In the example above we advertise using tiny in a ForCausalLM class which can't work since tiny does not share the same encoder as large-v3. We can advertise it as soon as distil-v3 is out.

cc @sanchit-gandhi to double check

Good with me! Note that in an ideal world, having the same encoder dimensions should not be a pre-requisite for speculative decoding. We should be able to run speculative decoding with two models even if they have different encoder dims (we just need to run a forward pass through each encoder once). The only constraint should be having the same vocab size (so that the logit space is shared) -> this is the violated condition with large-v3, since it uses a bigger vocab size than previous Whisper models

cc @reach-vb as well

Good catch with the vocabulary! RE: encoder dims, yes it's definitely possible, but it would also mean that we run the audio through two different whisper processors (to get the correct mel dimensions) which is not supported at the moment and I'm not sure it will be in the near future.

=> So I think it'll be best to wait for distil-whisper-v3 here for spec decoding :-)

patrickvonplaten changed pull request status to merged

Sign up or log in to comment