# bigscience /bloom

## Hosting bloom 176B model for inference#45

by NonStatic - opened

Hi guys, I'm very interested in run the 176B model locally. May I know how many A100-40GB GPUs are needed to host it? Thanks!

As a first order estimate, 176B parameters in half precision (16 bits = 2 bytes) would need 352 GB RAM. But since some modules are 32-bit, it would be more. So about nine GPUs with 40-GB RAM, and it doesn't take into account the input.

However you can offload some parameters to CPU RAM, which means it will be slower but atleast you can run it.

You can try a quantized version. Here's a 8-bit version I created using LoRA: https://huggingface.co/joaoalvarenga/bloom-8bit

I have successfully loaded it on a single x2iezn.6xlarge instance in AWS but using only CPUs the model is very slow. Text generation sampling for several sequences can take several minutes to return, but the full model is working and it is much cheaper for local evaluation than 9 GPUs!

As a first order estimate, 176B parameters in half precision (16 bits = 2 bytes) would need 352 GB RAM. But since some modules are 32-bit, it would be more. So about nine GPUs with 40-GB RAM, and it doesn't take into account the input.

Thats a huge amount of cost right? I had a look at the AWS instance types with GPU available and of the G5 instance family, not even the biggest instance has ~400 GB GPU Memory.
Checking out the cost for the biggest G5 Instance with the savings plan calculator currently gives me $12.8319 per hour. So wanting to host that for one year is 12.8319*24*365 =$112407.44.

Thats a huge amount of cost for hosting the inference endpoint, and I'm still only at 192GB GPU Memory, not sure if it would even work with only that instance. And I don't want to retrain, but only generate text base on an input prompt.

Do I have an error in my calculation and have misunderstood something? Or Is it really that expensive and difficult to host a full inference endpoint?

As a first order estimate, 176B parameters in half precision (16 bits = 2 bytes) would need 352 GB RAM. But since some modules are 32-bit, it would be more. So about nine GPUs with 40-GB RAM, and it doesn't take into account the input.

Thats a huge amount of cost right? I had a look at the AWS instance types with GPU available and of the G5 instance family, not even the biggest instance has ~400 GB GPU Memory.
Checking out the cost for the biggest G5 Instance with the savings plan calculator currently gives me $12.8319 per hour. So wanting to host that for one year is 12.8319*24*365 =$112407.44.

Thats a huge amount of cost for hosting the inference endpoint, and I'm still only at 192GB GPU Memory, not sure if it would even work with only that instance. And I don't want to retrain, but only generate text base on an input prompt.

Do I have an error in my calculation and have misunderstood something? Or Is it really that expensive and difficult to host a full inference endpoint?

you don't have any kind of error

Look for Tesla M40 24gb, it's cheap and support last CUDA drivers.

For cloud hosting a p4d.24xlarge will fit most of it in GPU memory (320GB), a p4de.24xlarge will fit all of it in GPU memory (640GB), but your looking at $32 -$41 /hr. Hosting these very large LMs for continuous use in a real time application is very cost prohibitive at this time unless you are selling a product using them where you can pass off the hosting cost onto customers, or you already own a bunch of expensive hardware. Much cheaper to use the HF API unless you have data use restrictions that require controlling the complete environment. In that case as I said above a x2iezn.6xlarge is the cheapest I have found to run it on AWS (\$5/hr) but it is too slow on CPUs for real time applications. Would work for offline/batch operations but throughput is very low as well so we have to run several instances.

Or Is it really that expensive [...] to host a full inference endpoint

Yes, in general, this will be expensive. Those models are just very large 😱

Note however that as @maveriq says,

you can offload some parameters to CPU RAM, which means it will be slower but atleast you can run it.

@Eldy Do you use tesla m40s?
For 24gb you would need around 20 of them.

What kind of hardware do you run them on? Rack servers or perhaps a mining type setup?

@IanBeaver
Have you got it running on any of the gpu instances? I'm curious what the inference times look like compared to cpu (a few minutes is painful for most use cases).

@Eldy Do you use tesla m40s?
For 24gb you would need around 20 of them.

What kind of hardware do you run them on? Rack servers or perhaps a mining type setup?

I plan to setup in Rack servers, but did not decide yet.

For OTP-175b enoght 16 x m40s. There are 2 instances (8 gc each) will work in Alpha parallelism https://alpa-projects.github.io/tutorials/opt_serving.html

@IanBeaver
Have you got it running on any of the gpu instances? I'm curious what the inference times look like compared to cpu (a few minutes is painful for most use cases).

I have not tried yet, but you can see some GPU times posted in https://huggingface.co/bigscience/bloom/discussions/59 however they seem very slow so I am not sure if there is some issue with the environment or configuration.

I have an IBM x 3950 x5 server sitting around in my basement with 512 GB of main memory and 80 CPU cores with an RTX Titan 24 GB in it. Do you think that one of the models will run that?

@IanBeaver
How long did it take to load the model? I tried to run Bloom-175B on a x2iezn.6xlarge instance, but it seems to get stuck after loading most of the model (There is almost no CPU usage after RAM reaches ~670GB of usage). I waited for about 3 hours before giving up. Did you have this problem?

@viniciusguimaraes I did not have that problem, but I was also not using the API to download the model either. I have had similar problems with the huggingface API downloader with other large models like T0pp where I have seen it apparently finish but the code never returns. So anymore I always prefetch large model files with git or curl and load them from a local path. Loading the model locally probably took 10 mins or more as I recall but I didn't time it to know the actual load time.

Out of curiosity I loaded it again on a x2iezn.6xlarge instance and timed it. Interestingly, it took 61 minutes to load, but it seemed to be mostly loaded within 10 minutes and spent the remaining 50 minutes cycling back and forth between 670Gb and 675GB of RAM while using 100% of one CPU core. Perhaps a recent update in transformers broke something in the load function? It did complete the load from local disk though.

This is the only code I executed for reference:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("/data/bloom")

When I run my script I have the same RAM behavior but CPU drops to 0.7%. I've tried loading from disk and from the HuggingFace API but had no success with neither.

The first time I tried to download the model I didn't have enough disk space and had to manually download the missing files after increasing the disk volume. I will try to download Bloom again using only git-lfs and see if that solves it. Anyway, thank you for taking some time of your day to respond.

@Tallin am late for the party, I've just made BLOOM running on my local server (following https://huggingface.co/bigscience/bloom/discussions/87) and I was observing the on the CPU (without GPU) but with decent number of cores, the calculations run quite fast. The VMs in cloud that come with amount of memory needed to keep whole model, come with at least 48 cores, I wonder if pytorch will use them effectively.
A side note (see cited discussion) is that not all CPUs run multithreaded, at least as you just 'walk up and - try to - use' them.