What are the memory requirements for running the model?

#6
by joanfihu - opened

Hello there, good work with this model. What are the system requirements for this model? I don't seem to be able to fit it in a 24GB GPU card

Hi @joanfihu , it works in a 24GB GPU using bfloat16 :)

@pcuenq Can you tell me how you modified the code to use bfloat16?

For anyone else wondering this,
Just update the following line:

model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map="cuda:0", torch_dtype=torch.float16)

Is there a way to get this to work on CPU with under 25Gigs of ram? when I tried to set it to float 16, get the following error:

"RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'"

It seems like float16 is not supported for some operations on CPU, so low precision CPU mode might need some actual tweaks to the code.

The reason I think this would be nice if possible is because many people don't have more than 32 gigs of ram or a graphics card with 20+ gigs of vram.

model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map="cuda:0", torch_dtype=torch.float16)

Maybe debatable, but have read that bfloat16 is recommended over float16 because of its structure. So: torch_dtype=torch.bfloat16
I tested both and model is loading faster in bf16 (6s vs 10s for f16), but inference is faster with f16 (1.22 vs 1.70 for bf16). Can't assess quality, but people saying bf16 is better.

@latent-variable
Thanks, it worked!

I have found when using multiple GPUs, this model requires above 32GB of VRAM. I have 3x A4000 16GBs (a weird setup I know), and only just manage to fit within their limit when using a custom device_map. I hit a few issues related to which layers were required to be on the same GPU, so if it helps others, the following layers need to be listed on the device_map on the same device:

device_map = {
    'language_model.model.embed_tokens.weight': 'cuda:0',
    'language_model.lm_head.weight':'cuda:0',
    'language_model.model.final_layernorm.bias':'cuda:0',
    'language_model.model.final_layernorm.weight':'cuda:0',
    'vision_embed_tokens.bias':'cuda:0',
    'vision_embed_tokens.weight':'cuda:0',
   # rest of the layers shared among devices..
}

My usage using torch_dtype=torch.bfloat16 used more like 40GB, though I might be doing something else wrong. Eg my code is:

model_id = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_id,torch_dtype=torch.bfloat16)
model = FuyuForCausalLM.from_pretrained(model_id, device_map=device_map) # torch_dtype=torch.bfloat16 should be here

Any insights would be appreciated. Above I used the torch_dtype=torch.bfloat16 on the wrong line πŸ€¦β€β™‚οΈ, never mind me.

Thanks for releasing such an interesting model! πŸ‘

Thanks for sharing you experiences!

I have run it on two rtx3060, each has 12GB. Needs custom device map and FP16. Can process only images in ~500x1000 maximum resolution though.

Sign up or log in to comment