Whisper-Large-V3 does not work with explicit use of dtype which is given in config.json

#42
by ait-paca - opened

Hi,
Thanks for sharing this model and related work.

I downloaded the Whisper-Large-V3 model using HF_hub snapshot by ignoring the patterns for msgpack, h5, fp32*, and safetensors. Thereby, only pytorch_model.bin (with dtype as float16) as the model file (~3GB), and all the rest of the repo files are downloaded.

In config.json, the torch_dtype is given as float16. If I use default settings (without any mentioning of torch_dtype or with explicitly defining it as torch_dtype = torch.float32) for pipeline or AutoConfig... + AutoModel... + AutoProcessor..., it works fine. However, if I try using it with torch_dtype = torch.float16, it gives the following error.

Input type (torch.FloatTensor) and weight type (torch.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor...

My setup:
OS: MacOS Monterey 12.7.1
device = "cpu"
Python: 3.11.6
Transformers: 4.35.2

2 questions:

  1. Why this model is not working with the dtype as given in config.json? If I explicitly define dtype for Large-V3 model, what other changes I should make for it to work?

  2. I also tried with all other Whisper models (tiny, base, small, and medium, all of which have dtype as float32 in config.json). I also faced this problem of using (torch_dtype=torch.float16) there. For this case, is it even possible to use (any) Whisper pytorch model with different dtype than for the entry in config.json or I'm making any basic mistake? If it is possible to use with different dtype, what adaptation I need to make?

Thank you in advance.

The Whisper model was trained in float16, hence the weights are in float16 on the Hub. When we call from_pretrained, we automatically upcast to float32, unless you specify torch_dtype=torch.float16 as you have done: https://huggingface.co/docs/transformers/main_classes/model#model-instantiation-dtype

To fix your issue:

  1. You also need to convert your inputs to the same dtype as the model, i.e. input_features = input_features.to(torch.float16). If you can share a code snippet of how you're running the model, I can show you where to add this line
  2. Yes - according to 1 you can run any Whisper model in your desired dtype, provided the input_features are the same dtype as the model

You can see an example for float16 evaluation for distil-whisper here: https://huggingface.co/distil-whisper/distil-large-v2#evaluation

Thank you @sanchit-gandhi for your response.

I share my little Colab notebook.
https://colab.research.google.com/drive/1uNCpZd6_g2MeuRn20AOW8cXs7Dbinoff

I'm using simple pipeline call. The types for data and device are defined in the 5th cell. As such, for a given media file (an audio file, I've tested it on mp4/webm/mp3 formats), this code runs perfectly fine with u_torch_dtype = torch.float32. However, any change here to float16 raises the issue I mentioned.

Please let me know what/where changes are required pertaining to input_features as you suggested above.

Sign up or log in to comment