How do you estimate the number of GPUs required to run this model?

#29
by vishjoshi - opened

The organisation I work with has HPC set up where I can ask for a number of NVIDIA A100/V100 GPUs to run inference on this model.

How many GPUs should I ask for?

I tried to run with 1 * NVIDIA A100 GPU vs 2 * NVIDIA A100 GPU but I dont see much performance increase in terms tokens per second and load time with more GPUs
Both were run with CUDA support

The easiest way to check how much vram the model requires is by checking the file size and add 1 or 2. If the file size is 20gb then add 1 or 2 and you get 22gb.

I would recommend using the q6 one since that has no quality loss but is a decent speed. It will roughly take 40gb vram which can easily fit in an a100.

Sign up or log in to comment