What does it take to self-host Bloom? How much money would that cost?

#161

by damc - opened Dec 20, 2022

damc

Dec 20, 2022

What does it take to self-host Bloom?

I would like to have an API endpoint where I can send the request with a prompt and get the output. What would it take to accomplish that? How much money would that cost presuming that I would generate let's say 10 000 tokens a day (about 40 000 characters)?

damc changed discussion title from What does it take to self-host Bloom? to What does it take to self-host Bloom? How much money would that cost? Dec 20, 2022

joojano

Dec 21, 2022

I saw somewhere that the full Bloom model, with 175B parameters, will use around 350GB of GPU RAM, so it'll cost a lot. Let's see.

If we use Vultr pricing as example, and considering that we need around 350GB of GPU RAM because we'll deploy the full Bloom model, we can get 1 instance of their 4xNvidia A100 (320GB GPU RAM) that costs USD10.417/hour (or USD 7,000/month), plus 1 instance of 1/2 Nvidia A100 (40GB GPU RAM) that costs USD1.302/hour (or USD875/month).

This setup will have around 360GB of GPU, so that should be enough to start. If we sum USD7,000 + USD875 = USD7,875 per month, or 10.417+1.302= USD11.719 per hour to run this model self hosted with a reasonable response time.

Of course this presumes that you'll use the full model, if you use less parameters (eg. bloom-3b) you can greatly decrease your resouce needs, like in this article, that deploys the 3B model into Amazon SageMaker. They're using the m1.g4dn.xlarge, that is USD 0.7364 per hour to run (around USD 530 per month).

You'll need to check if the full Bloom model is really needed, so you'll need to check other smaller Bloom models to run a cost-effective inference API.

damc

Dec 21, 2022

Thanks for your great answer.

Fusseldieb

Dec 21, 2022

•

edited Dec 21, 2022

I've read in another thread that there are Tesla M40's (24GB), which are older, however, do have whooping 24GB. Since they are reasonably priced at $150-180 "open package", purchasing a load of them should get you to 350GB.

However, the question remains:
Would that work at a reasonable speed (at least 1 second per token)? Could multiple GPUs work together, if hooked up to the same system (think of like a mining rig)?
If yes, I could see people building this. I could even see myself building this :)
But I think almost nobody has $3k to just "play around and find out", so a definitive answer would be great.

underlines

Dec 22, 2022

I see so much optimization going on for transformer based models like StableDiffusion and LLMs like Bloom.

Run it 8bit in half precision
use xformers
run on DeepSpeed-MII for a 40x cost reduction for Bloom on Azure https://github.com/microsoft/DeepSpeed-MII

But so far I couldn't deploy it on Azure. No compute units available with more than 12GB RAM and I have to manually request higher quota, which doesn't work with my free trial account (containing 200USD)