GPU requirements

#68
by GuySerk - opened

Hi guys,

What is the minimum GPU requirement to run falcon-40b model?
And which GPU is best suited for this model?

GuySerk changed discussion title from GPU requirement to GPU requirements

80+GB of Vram is required for 16bit model.

try 48GB with 8bit.

You can run it on a single 3090 at 17tokens/sec (up to 25 on a 4090) when using the Q2_K variant with ggllm.cpp
Output quality is great, it probably loses a bit in precision but writes flawless poems and summaries in multiple languages.

Could I run the 16bit on this? 8 GB Memory / 4 Intel vCPUs / 160 GB Disk -
Ubuntu 22.10 x64

Could I run the 16bit on this? 8 GB Memory / 4 Intel vCPUs / 160 GB Disk -
Ubuntu 22.10 x64
No.
This could work:
Buy for $79USD x month a Skylake SSD 3XL9 100

Could I run the 16bit on this? 8 GB Memory / 4 Intel vCPUs / 160 GB Disk -
Ubuntu 22.10 x64
No clearly not, also the affiliate link posted below is not suitable.

  1. If you want to run the 16 bit version you'll need around 85-90GB of RAM/VRAM. Your server has 8GB and the affiliate link has 64GB.
  2. Even if you'd use a server like that with enough RAM your speed would be like 1 token processed every 15-30 seconds. Literally not useable for anything but as a room heater.
  3. There are only academic reasons that would come to my mind why you'd want to run a 16 bit version of Falcon on a CPU, it's hard to find a good reason why you'd want to inference that on GPU either.
    There are no quality benefits over a high quality quantized version, the RAM requirements are extreme and the processing speed slow.

On your server you can not expect to run Falcon 40B, the smallest version of 40B using the cmp-nct repository is around 13GB with processing buffers included. And that's at 2.5bit quantization.
15GB on 3.5 bit, almost 24 at 4.5 bit
However, you can run Falcon 7B on that machine, in 5 bit quanitzation you'll have about the same quality as 16 bit and need roughly 7GB RAM for processing it. Speed a couple tokens per second.

Quick question for the community

I have an opportunity to pick up 4 second hand 24GB 3090's at a reasonable discount.

The only reason I would do this is if I could run Falcon-40B on them. I understand from above the performance on a single 3090 would be around 17 tokens/sec. Would a (x4) configuration be a viable to run the 40B parameter model? Would there be a faster configuration to run Falcon-40B in the ~$3k price range? As a bonus question, can anyone speculate on the kind of performance I might realistically expect to achieve? This is a training system for me, but I'd also like to actually try and actually use if for real world problems if possible.

Thank you for any assistance and advice. I don't want to spend the money if this is not going to work.

Quick question for the community

I have an opportunity to pick up 4 second hand 24GB 3090's at a reasonable discount.

The only reason I would do this is if I could run Falcon-40B on them. I understand from above the performance on a single 3090 would be around 17 tokens/sec. Would a (x4) configuration be a viable to run the 40B parameter model? Would there be a faster configuration to run Falcon-40B in the ~$3k price range? As a bonus question, can anyone speculate on the kind of performance I might realistically expect to achieve? This is a training system for me, but I'd also like to actually try and actually use if for real world problems if possible.

Thank you for any assistance and advice. I don't want to spend the money if this is not going to work.

If you run it using ggllm (https://github.com/cmp-nct/ggllm.cpp) a single 3090 can run 3 bit, on windows also 4 bit with a bit squeezing.
Performance on a single 3090 is probably around 15tk/sec, on two 3090 (for larger variants) I'm not sure about performance. Likely about the same.
More than 2 would never be needed as long as no larger Falcon is released.
The upcoming release will have special support for 4090ies that use the better compute capability, mostly relevant for prompt processing speed.

On python I don't know, I'd assume you can run it on 4x 3090 using pytorch or 2x 3090 using gptq.

Sign up or log in to comment