Why 12b? Who could run that locally?

#1
by kaidu88 - opened

The model looks good for sure, but why is there only a 12b model? :(
Even with the best consumer hardware you could barely load this model into vram.

Any plans of making a smaller or distilled model with, e.g. 2b-4b, that could run on 24gb vram?

This comment has been hidden

why do you people complain about open source models being awful and then expect it to be good at 1B parameters? have you ever used a 1B LLM compared to even a 7B LLM?

quantize to 8 bit -> runs on 12gb -> runs on your 3060

I have a 3090 with 24gb vram. But 12b parameters in float16 format are still ~24GB and this does not include the two text encoders nor the internal state of the model.

quantizing does not work for image models as it does for llms. At least for all image models I tried so far. Maybe this is the only exception but I would be surprised. Besides that, llms are trained on MUCH larger datasets than image models. I doubt that a 12b image model is really better than a 4b image model - we just don't have enough training data for that. PixArt Alpha is a nice example where a 0.6b model outperforms 2b models with ease.
Besides that, even for llms we nowadays go more and more into the direction of using llms that fit into consumer hardware. So yes, people prefer 7b llms to 400b llms for most tasks, as they are more efficient, run on consumer hardware and are good enough for most of the tasks. I'm pretty sure there is much space for improvement from the current sota open source models like SDXL, Würstchen, PixArt and so on to a model that still fits into the vram of consumer hardware.

you can quantize to 8 bit and lose nothing dawg. they're selling a service here so why would they skimp on the vram to please people who AREN'T gonna pay? take what you're given

lmao image gen people finally know how we feel

I'm happy to try out quantization Sayak. Any idea when flux will be supported in diffusers?

The quantization we use for image models is really primitive compared to LLMs, I think because users weren't as desperate, lol. Naive FP8 rounding/quantization destroys LLMs too.

I'm sure people will cram it in 3090s with more advanced schemes.

PR is open. Will be merged shortly.

I doubt that a 12b image model is really better than a 4b image model - we just don't have enough training data for that. PixArt Alpha is a nice example where a 0.6b model outperforms 2b models with ease.

pixart uses T5

The text encoder can be processed independently of the model - that's fine. I don't care so much about the size of the text encoders but about the size of the diffusion transformer. Seems like you can quantize this to 8bit without too much loss. I still see not big chances that we can finetune it on consumer hardware, even with loras it will be hard. I like the model - it would be still nice to have a smaller variant even if that would be slightly worse in terms of quality.

Wait, how are you guys having problems running this on a 3090? I'm on one and it runs FINE. I wouldn't want any less than the best so I'm glad it's 12b.

Tried the schnell version. I just did what Comfy said on his Flux example page and it works with 12gb vram without any problems. The only thing that looks scary is my system ram going up to nearly 32/32gb when it loads the model lol. I tried both the default and fp8 setting. Don't see a difference in quality honestly. But I think if there will be any kind of controlnet or loras it would be too much for 12gb. At least as of now, probably some smart people will come up with anything to reduce vram requirements. They always do :D I will test the dev version aswell but I'm too lazy to download another model rn

Did someone managed to run it on MacOs? Looks like it's trying to use around 50gb of RAM because of bf16

Here is my script for running it in <16gb VRAM.

https://gist.github.com/AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9

thx for your service , Ser
only change i had to do as of now was to use pip install git+https://github.com/huggingface/diffusers.git@27637a5

I tested it on WSL2 win11 16GB VRAM

test_flux_distilled3.png
test_flux_distilled2.png
test_flux_distilled1.png
test_flux_distilled0.png

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

Idk about splitting the whole model across two gpus but you could just put the text encoders on one 3090 and the diffusion model on the other, should be able to run it in fp16 that way

llm people: first time?

Working fine on 24GB

from transformers import T5EncoderModel
import time
import gc
import torch
import diffusers

def flush():
    gc.collect()
    torch.cuda.empty_cache()

t5_encoder = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", subfolder="text_encoder_2", revision="refs/pr/7", torch_dtype=torch.bfloat16
)
text_encoder = diffusers.DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    text_encoder_2=t5_encoder,
    transformer=None,
    vae=None,
    revision="refs/pr/7",
)
pipeline = diffusers.DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", 
    torch_dtype=torch.bfloat16,
    revision="refs/pr/1",
    text_encoder_2=None,
    text_encoder=None,
)
pipeline.enable_model_cpu_offload()



@torch

	.inference_mode()
def inference(self, prompt, num_inference_steps=4, guidance_scale=0.0, width=1024, height=1024):
    self.text_encoder.to("cuda")
    start = time.time()
    (
        prompt_embeds,
        pooled_prompt_embeds,
        _,
    ) = self.text_encoder.encode_prompt(prompt=prompt, prompt_2=None, max_sequence_length=256)
    self.text_encoder.to("cpu")
    flush()
    print(f"Prompt encoding time: {time.time() - start}")
    output = self.pipeline(
        prompt_embeds=prompt_embeds.bfloat16(),
        pooled_prompt_embeds=pooled_prompt_embeds.bfloat16(),
        width=width,
        height=height,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps
    )
    image = output.images[0]
    return image

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.
ComfyUI_00111_.png

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.
ComfyUI_00111_.png

Yes already SwarmUI runs it at fp8 default

But I am making auto installer for a big tutorial and people will save 11gb file size

By the way I found fp8 and fixed metada and works :)

thanks so much for all the work!

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.

Your workflow (embedded in the image) is using the schnell model not the dev model. The schnell model works w/ 1 to 4 steps out of the box (don't need to use LCM sampler).

If I'm missing something and you have a way to generate proper images w/ the Dev model using only 4 steps please elaborate.

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.

Your workflow (embedded in the image) is using the schnell model not the dev model. The schnell model works w/ 1 to 4 steps out of the box (don't need to use LCM sampler).

If I'm missing something and you have a way to generate proper images w/ the Dev model using only 4 steps please elaborate.

OMG... you are right. This happens when Comfy opens a second window, where I set up the other one with schnell. Dev did not do that. Sorry for the wrong input! It's only a blurry mess.

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

@freeqaz Loaded across my 2x 3090s using WSL. It seems to be using more of GPU 0

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, device_map='balanced')
# pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power

prompt = '''A dog holding up a sign with a rainbow in it, reading "OP"'''
image = pipe(
    prompt,
    height=512,
    width=512,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
# image.show()
image.save("flux-dev.png")

Example Image

is the a fp4 for flux dev maybe? I have an 8GB GPU.

https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c

hey sayak how can we run flux schnell in fp4? do you have a code example?

Sign up or log in to comment