Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

What is the required amount of VRAM for running it?

#5
by boohwooh - opened

I will run this on runpod.io for test. The base model (llama-2 70b) is 120Gb but this is more than 300Gb. What is the required amount of VRAM for running it?

I'm running it quantized to 4bits (with a typical load_in_4bit bitsandbytes) and it's using 46.511GB.

I'm running it quantized to 4bits (with a typical load_in_4bit bitsandbytes) and it's using 46.511GB.

Thank you. I was curious about the base model vram requirements.

boohwooh changed discussion status to closed

Sign up or log in to comment