Necessary material for llama2

#27
by Samitoo - opened

Hello !

I just wanted to know if the gptq model is the one needed to run the model on a gpu? and if so, what type of graphics card do you need to run the 7B, 13B or 70B models? And is it possible to run the model on several small graphics cards?

Thank you for your answers !

Yes, GPTQ is for running on GPU. Actually, GGML can run on GPU as well. But GPTQ can offer maximum performance.

The GPU requirements depend on how GPTQ inference is done. If you use ExLlama, which is the most performant and efficient GPTQ library at the moment, then:

  • 7B requires a 6GB card
  • 13B requires a 10GB card
  • 30B/33B requires a 24GB card, or 2 x 12GB
  • 65B/70B requires a 48GB card, or 2 x 24GB

Yes you can split inference across multiple smaller GPUs, eg 2 x 24GB is a common way to run 65B and 70B models. It is slower than using one card, but it does work. For example, using ExLlama with 2 x 4090 24GB GPUs can give 18 - 20 tokens/s with 65B and 14-17 tokens/s with 70B. A single 48GB card like an A6000 would likely do 20+ tokens/s.

Thank you very much for your answer !

So if I understand correctly, to use the TheBloke/Llama-2-13B-chat-GPTQ model, I would need 10GB of VRAM on my graphics card. But is there a way to load the model on an 8GB graphics card for example, and load the rest (2GB) on the computer's RAM?

In addition, how many simultaneous requests on a 4096 input can be performed on this model with a 24GB 3090? I know the model will load 10GB onto the board, plus 3GB of runtime kernel, and some for the query. But if the card is loaded at 14GB, there are 10GB left, if a request requires 1GB of space, does this mean that I could manage 10 requests simultaneously?

Thank you very much for your answers and your work!

Thank you very much for your answer !

So if I understand correctly, to use the TheBloke/Llama-2-13B-chat-GPTQ model, I would need 10GB of VRAM on my graphics card. But is there a way to load the model on an 8GB graphics card for example, and load the rest (2GB) on the computer's RAM?

In addition, how many simultaneous requests on a 4096 input can be performed on this model with a 24GB 3090? I know the model will load 10GB onto the board, plus 3GB of runtime kernel, and some for the query. But if the card is loaded at 14GB, there are 10GB left, if a request requires 1GB of space, does this mean that I could manage 10 requests simultaneously?

Thank you very much for your answers and your work!

Not possible on GPTQ, GPTQ only support split between GPUs

Yes, GPTQ is for running on GPU. Actually, GGML can run on GPU as well. But GPTQ can offer maximum performance.

The GPU requirements depend on how GPTQ inference is done. If you use ExLlama, which is the most performant and efficient GPTQ library at the moment, then:

  • 7B requires a 6GB card
  • 13B requires a 10GB card
  • 30B/33B requires a 24GB card, or 2 x 12GB
  • 65B/70B requires a 48GB card, or 2 x 24GB

Yes you can split inference across multiple smaller GPUs, eg 2 x 24GB is a common way to run 65B and 70B models. It is slower than using one card, but it does work. For example, using ExLlama with 2 x 4090 24GB GPUs can give 18 - 20 tokens/s with 65B and 14-17 tokens/s with 70B. A single 48GB card like an A6000 would likely do 20+ tokens/s.

I have 2 GPUs (Each of 8 GB), and I want to use 13B, please guide how can I use 2 GPU?

@saifhassan have you figured out the way?

Does exllamv2 allow concurrent request processing?

@abpani1994 yes but for users over like 10, you should probably use some batch inference library like vllm, aphrodite, tgi, sglang. Else exllama v2 should be a fine for few users.

Sign up or log in to comment