Please post f16 quantization.

by ZeroWw - opened May 23, 2024

Discussion

ZeroWw

May 23, 2024

•

edited May 23, 2024

Please post f16 quantization.
Requantizing is better from f16 or f32.
If you can, post them both.

shing3232

May 28, 2024

I through the original format is BF16.

ZeroWw

Jun 14, 2024

yes.. but f16 (fp16) does not cause harm to the model. bf16 is way bigger.

bartowski

Qwen org Jun 14, 2024

BF16 and F16 should be identical in size

If you need the f32 i uploaded it here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-GGUF/blob/main/Qwen2-7B-Instruct-f32.gguf

ZeroWw

Jun 23, 2024

hmm maybe I got confused... I though bf16 was way bigger than f16 (I know they are both 16 bit) perhaps I was tired and read it wrong.
anyway I posted now my quantizations of quen 1.5 and qwen2...

bartowski

Qwen org Jun 23, 2024

Bf16 represents a larger range of values but is not bigger

ZeroWw

Jun 23, 2024

Bf16 represents a larger range of values but is not bigger

Got it. thanks.

ZeroWw

Jun 24, 2024

Bf16 represents a larger range of values but is not bigger

On second thought, I checked and I don't agree: if I quantize to bf16 using llama I get a way bigger size I get if I quantize to f16.
Perhaps it's because llama does a mixed quantization and keeps some tensors at f32...
Anyway I see no degradation at pure f16.

bartowski

Qwen org Jun 24, 2024

that's llama.cpp doing it then, if you take a bf16 and convert it to fp16 the model size stays identical

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment