Why float16? Why not Float32

#4
by ctranslate2-4you - opened

So why was the decision made to only release the float16 version, or is that the way it was trained from the get to? All other whisper models are in float32, everything up to large-v2? Any admins care to enlighten the rest of us? Thanks.

OpenAI only publish fp16 weights, so we know the weights work as intended in half-precision. To improve the download speed for users, the main transformers weights are also fp16 (half the size of fp32 weights => half the download time). So that fp32 weights can be used, I've also pushed fp32 weights to this repo: #5. You can load them by passing variant="fp32" when you call from_pretrained:

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3", variant="fp32")

with attn_implementation="flash_attention_2" i dont see any performance boost when using the fp32 vs not

model = WhisperForConditionalGeneration.from_pretrained(
            model_id,
            variant="fp32" if on_gpu_float_size == 32 else None,
            torch_dtype=torch_dtype,
            low_cpu_mem_usage=True,
            use_safetensors=True,
            attn_implementation="flash_attention_2" 
)

This is as expected: there shouldn't be any performance boost in using the fp32 weights @isaac-telebroad . Since the model was only trained in fp16, upcasting the weights from fp16 -> fp32 for inference won't give any performance gain (no free lunches!)

Sign up or log in to comment