city96/HunyuanVideo-gguf · Fp8 ggufs please

4 days ago

Today Hunyuan team released fp8 checkpoint. Could you please make quants for it?

https://x.com/TXhunyuan/status/1869266784039440520?t=nq_PriIprQHHCecz8ZVzFw&s=19

Owner 3 days ago

It looks like that checkpoint is the same one as the BF16 one, and quantizing from FP8 to i.e. Q8_0 would result in a worse model than going from BF16 to Q8_0 since the source checkpoint is already in a lower precision with FP8.

upstream99

3 days ago

•

edited 3 days ago

@city96 according to this reddit comment fp8 model from Hunyuan is different than BF16:
https://www.reddit.com/r/StableDiffusion/comments/1hgtsmi/comment/m2mm1ma/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Please see this screenshot also from Hunyuan devs about fp8 cooking:
https://imgur.com/OekygWS

city96

Owner 3 days ago

Based on that comment it's just a different quantization method, I think, so there'd be no easy way to mix and match it together with the gguf quants (since you obviously can't double-quantize something lol). I guess if it was better than Q8_0 then it might be worth making a generic separate custom node for what they call w8a8 quantization, but I can't even run Q8_0 because I'm on a 10GB 3080, so I'll leave that part to someone else.

Also, some comments on the screenshot:

layer-by-layer parameter statistics

Unsure what this part means, unless it's similar to scaled FP8...? GGUF quants are by nature layer-by-layer since you specify what precision you want per layer, but again, unsure what the statistics part means.

removes linear quantization for some sensitive layers

GGUF files do have this logic in place already, and the layers that made sense to be left at max precision (or high-ish precision for low bitrate quants) were handled separately. If you go to the file list and click the little "GG" button with the arrow you can see what weights are at what precision.

activation value quantization

I think this either means quantizing the hidden states (lower vram usage) or calibrating based on the inputs for the weights and how they interact by recording this on a calibration set (for GGUF files, this would be called imatrix, something I still need to experiment with to improve quantization quality for lower bit quants, as I only have a non-functional POC based on the llama.cpp code lol)