RAM and VRAM utilization

#1
by chenxiangyi10 - opened

Thank you for providing this great model.

In a table of the model card, it gives the max ram required. If I offload the model to GPU, will it require the same amount of VRAM?

And does the "max ram required" mean the ram required to use the context length (2048) fully?

You're welcome.

If you offload to GPU, yes it will need roughly as much VRAM as you see there in the RAM table. In other words if it says it needs 18GB RAM and you offload all layers to VRAM, it will need somewhere 18GB VRAM. Actually probably a bit more than that, especially if you're using the latest llama.cpp code which now has full GPU acceleration.

If you do offload to VRAM, you won't then need as much RAM. That is why it says "max RAM required", because that's the amount of RAM that's needed if you don't offload to GPU at all. If you fully offloaded to GPU, it would only need about 3GB RAM. If you offloaded half the layers, then it would need half the RAM, and the other half in VRAM. And so on.

With llama.cpp/GGML, RAM usage does not really change as context gets longer. So these RAM figures are for 2048 context, but also 500 context. It grows a little bit at the start, but it's not like GPTQ where VRAM usage keeps growing as context increases.

You're welcome.

If you offload to GPU, yes it will need roughly as much VRAM as you see there in the RAM table. In other words if it says it needs 18GB RAM and you offload all layers to VRAM, it will need somewhere 18GB VRAM. Actually probably a bit more than that, especially if you're using the latest llama.cpp code which now has full GPU acceleration.

If you do offload to VRAM, you won't then need as much RAM. That is why it says "max RAM required", because that's the amount of RAM that's needed if you don't offload to GPU at all. If you fully offloaded to GPU, it would only need about 3GB RAM. If you offloaded half the layers, then it would need half the RAM, and the other half in VRAM. And so on.

With llama.cpp/GGML, RAM usage does not really change as context gets longer. So these RAM figures are for 2048 context, but also 500 context. It grows a little bit at the start, but it's not like GPTQ where VRAM usage keeps growing as context increases.

Thank you so much. It is very interesting that "it is not like GPTQ where VRAM usage keeps growing as context increases."

Sign up or log in to comment