can you provide generation examples? is the quantized version coherent?

#1
by MayensGuds - opened

I will provide some comparisons with official demo later. I am running 4-bit quantized llm so that will have a bigger impact than bf16/fp16 t2i model. By default also the official demo is using bf16, but it's doing the downcasting from FP32 to bf16 while you load the model, so you download 10GB more data but end up with the same model in vram.

I compared quantized 2b with a official 2b demo. It's close in quality, but not pixel-to-pixel the same.

@MayensGuds Here are generations of local BF16 pth (I haven't updated over to safetensors yet) Lumina Next 2B with bnb 4-bit Gemma 2B vs their online demo that I think runs fp32 file downcasted to bf16 and probably fp16 Gemma 2B. So, I think most of the difference is due to strongly quantized Gemma 2B as opposed to the t2i model.
https://pixeldrain.com/l/FP6BzFH7
They pulled their Lumina 5B demo offline, so I can't compare that easily. This Lumina Next T2I bf16 and Gemma 2 4-bit setup takes up around 13GB of VRAM. I suggest using Lumina Next 2B over Lumina 5B, it just creates more pleasant generations.

Sign up or log in to comment