What are the memory requirements for running the model?

by joanfihu - opened Oct 18, 2023

Oct 18, 2023

Hello there, good work with this model. What are the system requirements for this model? I don't seem to be able to fit it in a 24GB GPU card

pcuenq

Oct 19, 2023

Hi @joanfihu , it works in a 24GB GPU using bfloat16 :)

latent-variable

Oct 19, 2023

@pcuenq Can you tell me how you modified the code to use bfloat16?

latent-variable

Oct 19, 2023

For anyone else wondering this,
Just update the following line:

model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map="cuda:0", torch_dtype=torch.float16)

balisujohn

Oct 21, 2023

Is there a way to get this to work on CPU with under 25Gigs of ram? when I tried to set it to float 16, get the following error:

"RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'"

It seems like float16 is not supported for some operations on CPU, so low precision CPU mode might need some actual tweaks to the code.

The reason I think this would be nice if possible is because many people don't have more than 32 gigs of ram or a graphics card with 20+ gigs of vram.

Neman

Oct 22, 2023

model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map="cuda:0", torch_dtype=torch.float16)

Maybe debatable, but have read that bfloat16 is recommended over float16 because of its structure. So: torch_dtype=torch.bfloat16
I tested both and model is loading faster in bf16 (6s vs 10s for f16), but inference is faster with f16 (1.22 vs 1.70 for bf16). Can't assess quality, but people saying bf16 is better.

Colderthanice

Oct 23, 2023

@latent-variable
Thanks, it worked!

layoric

Oct 23, 2023

•

edited Oct 23, 2023

I have found when using multiple GPUs, this model requires above 32GB of VRAM. I have 3x A4000 16GBs (a weird setup I know), and only just manage to fit within their limit when using a custom device_map. I hit a few issues related to which layers were required to be on the same GPU, so if it helps others, the following layers need to be listed on the device_map on the same device:

device_map = {
    'language_model.model.embed_tokens.weight': 'cuda:0',
    'language_model.lm_head.weight':'cuda:0',
    'language_model.model.final_layernorm.bias':'cuda:0',
    'language_model.model.final_layernorm.weight':'cuda:0',
    'vision_embed_tokens.bias':'cuda:0',
    'vision_embed_tokens.weight':'cuda:0',
   # rest of the layers shared among devices..
}

My usage using torch_dtype=torch.bfloat16 used more like 40GB, though I might be doing something else wrong. Eg my code is:

model_id = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(model_id,torch_dtype=torch.bfloat16)
model = FuyuForCausalLM.from_pretrained(model_id, device_map=device_map) # torch_dtype=torch.bfloat16 should be here

~~Any insights would be appreciated~~. Above I used the torch_dtype=torch.bfloat16 on the wrong line 🤦‍♂️, never mind me.

Thanks for releasing such an interesting model! 👏

ArthurZ

Oct 25, 2023

Thanks for sharing you experiences!

ludeksvoboda

Nov 5, 2023

•

edited Nov 5, 2023

I have run it on two rtx3060, each has 12GB. Needs custom device map and FP16. Can process only images in ~500x1000 maximum resolution though.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment