llama 3 inference scale

#91
by metalfusion10 - opened

Can someone give me a brief estimate, if i can perform inference on a Llama model via a locally developed chat application using the transformers library that scales across 50 ppl with 2 Nvidia 5000 ada 36GB VRam GPU's

Im trying to figure out how fast each user would get their repose back after inference, please do let me know

can it run at the double rtx 4090,just like 24+24=48?

Sign up or log in to comment