Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

What does it take to self-host Bloom? How much money would that cost?

#161
by damc - opened

What does it take to self-host Bloom?

I would like to have an API endpoint where I can send the request with a prompt and get the output. What would it take to accomplish that? How much money would that cost presuming that I would generate let's say 10 000 tokens a day (about 40 000 characters)?

damc changed discussion title from What does it take to self-host Bloom? to What does it take to self-host Bloom? How much money would that cost?

I saw somewhere that the full Bloom model, with 175B parameters, will use around 350GB of GPU RAM, so it'll cost a lot. Let's see.

If we use Vultr pricing as example, and considering that we need around 350GB of GPU RAM because we'll deploy the full Bloom model, we can get 1 instance of their 4xNvidia A100 (320GB GPU RAM) that costs USD10.417/hour (or USD 7,000/month), plus 1 instance of 1/2 Nvidia A100 (40GB GPU RAM) that costs USD1.302/hour (or USD875/month).

This setup will have around 360GB of GPU, so that should be enough to start. If we sum USD7,000 + USD875 = USD7,875 per month, or 10.417+1.302= USD11.719 per hour to run this model self hosted with a reasonable response time.

Of course this presumes that you'll use the full model, if you use less parameters (eg. bloom-3b) you can greatly decrease your resouce needs, like in this article, that deploys the 3B model into Amazon SageMaker. They're using the m1.g4dn.xlarge, that is USD 0.7364 per hour to run (around USD 530 per month).

You'll need to check if the full Bloom model is really needed, so you'll need to check other smaller Bloom models to run a cost-effective inference API.

Thanks for your great answer.

I've read in another thread that there are Tesla M40's (24GB), which are older, however, do have whooping 24GB. Since they are reasonably priced at $150-180 "open package", purchasing a load of them should get you to 350GB.

However, the question remains:
Would that work at a reasonable speed (at least 1 second per token)? Could multiple GPUs work together, if hooked up to the same system (think of like a mining rig)?
If yes, I could see people building this. I could even see myself building this :)
But I think almost nobody has $3k to just "play around and find out", so a definitive answer would be great.

I see so much optimization going on for transformer based models like StableDiffusion and LLMs like Bloom.

But so far I couldn't deploy it on Azure. No compute units available with more than 12GB RAM and I have to manually request higher quota, which doesn't work with my free trial account (containing 200USD)

BigScience Workshop org

@damc @Fusseldieb

Check out Petals: https://github.com/bigscience-workshop/petals It gives the inference speed of ~1 sec/token for BLOOM-176B.

There is also an HTTP/WebSocket API endpoint: https://github.com/borzunov/chat.petals.ml

Thank you for sharing, looks interesting.

Sign up or log in to comment