What is right GPU to run this

#7
by Varunk29 - opened

I tried using 4 * 24 GB inference was very slow, Can you suggest the right gpu to run it on for fast inference

I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.

P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.

I'm having success running it on a 80GB A100, generating about 22 tokens/s (with up to around 10 concurrent requests). Seems to be working after bumping the latest vLLM and TGI versions.

P.S. For GPU access I'm using Modal (disclaimer: I work at Modal) - there are a couple of examples (TGI, vLLM) there for how to run this quickly.

Thanks for the suggestion, quite helpful :)

@Varunk29 Are generate() encode() functions from tokenizer and model thread safe?

I also want to concurrent inferences (from multiple threads on same model object) but not sure if they are thread safe?

i tried running 4*v100(32G) inference was very slow, One inference takes 6 minutes.input token len 1700

Sign up or log in to comment