Quantization scripts

#1
by WaveCut - opened

Could you share the scripts you used to quantize both transformer and text_encoder2, as i want to reproduce it using different merged flux checkpoint.

Thanks in advance!

This comment has been hidden

This is the fastest code I have tried so far. 30 seconds to generate 1024x1024 on a RTX 3080. That's faster than SDXL and many times better quality. Pretty amazing really. I think @HighCWu has something here. Could probably use some way to add LORAs. GGUF support would be really awesome.

Could you share the scripts you used to quantize both transformer and text_encoder2, as i want to reproduce it using different merged flux checkpoint.

Thanks in advance!

You can load any transformer with this, just re-use @HighCWu 's text_encoder_2:

from diffusers import FluxPipeline
flux = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=None,
    text_encoder_2=None,
    torch_dtype=torch.bfloat16,
)

from model import T5EncoderModel as T5EncoderModel #better to run the non-quantized version of this if you can
text_encoder_2: T5EncoderModel = T5EncoderModel.from_pretrained(
    "HighCWu/FLUX.1-dev-4bit",
    subfolder="text_encoder_2",
     torch_dtype=torch.bfloat16,
)
flux.text_encoder_2 = text_encoder_2

model_id = "your other flux model" # <---------------- any flux model 
from model import FluxTransformer2DModel as FluxTransformer2DModel #HighCWu's transformer class
transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
)
flux.transformer=transformer
flux.enable_model_cpu_offload()

ok I got it working, thanks for the hints

  1. you need to convert the fp8 model to diffusers format https://github.com/huggingface/diffusers/blob/main/scripts/convert_flux_to_diffusers.py this may require adding "model.diffusion_model" in each key before mapping, make sure to save it in bf16, i tried fp8 formats, they are not compatible
  2. you need to load it with this codebase, passing quantization_config (BitsAndBytesConfig) to FluxTransformer2DModel.from_pretrained
  3. save the model with .save_pretrained

I went the wrong way and tried quantizing using official bitsandbytes branch https://github.com/huggingface/diffusers/pull/9213 it is bugged and has wrong layer shapes after saving

For civitai models you can do this after you download the file from civitai:

f = FluxPipeline.from_single_file(
    filepath_to_local_file,
    scheduler=None,
    tokenizer=None,
    tokenizer_2=None,
    #transformer=None, #only load the transformer
    text_encoder=None,
    vae=None,
    text_encoder_2=None,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
)

f.save_pretrained("yournewfluxfolder/"+your_model_name)

This will save only the transformer, making a transformer/ subfolder. Then load it by itself using the transformer subfolder just like you do normally.

transformer: FluxTransformer2DModel = FluxTransformer2DModel.from_pretrained(
        "yournewfluxfolder/"+your_model_name,
        subfolder="transformer",
        torch_dtype=torch.bfloat16,
        load_in_4bit=True,
    )

model.transformer=transformer

I found the results are way better if you can run the non-quantized t5_xxl model (text_encoder_2). I was able to do this with my second 10g GPU. Only running the HighCWu with 4bit is still enough to look just as good as the full version in almost all cases. Also only takes like 20 seconds to generate a 1024x1024 image on my RTX 3080. Even the full version of t5_xxl is severely limited. I don't expect much from it (especially for NSFW). Until someone trains a better t5 for this we are stuck with it. I hear its really hard to train certain content and I am pretty sure it's because of the t5.

Another thing I was able to do is img2img using this as a refiner for SDXL models. The latents won't convert but the PIL image will. You can do SDXL for n steps then do 4-8 steps with flux for the final image. I had to use the small HighCWu version of t5 for this though, because img2img takes up more memory than my 3080 can handle. Thing is you don't really need it as much anyway since you're mostly going off the pregenerated image.

from diffusers import FluxImg2ImgPipeline
flux_img2img = FluxImg2ImgPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-schnell",
            text_encoder_2=text_encoder_2_small, #use HighCWu's text_encoder_2 for this
            transformer=transformer, #whatever 4bit transformer you are using
            torch_dtype=torch.bfloat16,
            use_safetensors=True,
        )

Sign up or log in to comment