Necessary material for llama2

#27

by Samitoo - opened Aug 3, 2023

Aug 3, 2023

Hello !

I just wanted to know if the gptq model is the one needed to run the model on a gpu? and if so, what type of graphics card do you need to run the 7B, 13B or 70B models? And is it possible to run the model on several small graphics cards?

Thank you for your answers !

TheBloke

Owner Aug 5, 2023

•

edited Aug 5, 2023

Yes, GPTQ is for running on GPU. Actually, GGML can run on GPU as well. But GPTQ can offer maximum performance.

The GPU requirements depend on how GPTQ inference is done. If you use ExLlama, which is the most performant and efficient GPTQ library at the moment, then:

7B requires a 6GB card
13B requires a 10GB card
30B/33B requires a 24GB card, or 2 x 12GB
65B/70B requires a 48GB card, or 2 x 24GB

Yes you can split inference across multiple smaller GPUs, eg 2 x 24GB is a common way to run 65B and 70B models. It is slower than using one card, but it does work. For example, using ExLlama with 2 x 4090 24GB GPUs can give 18 - 20 tokens/s with 65B and 14-17 tokens/s with 70B. A single 48GB card like an A6000 would likely do 20+ tokens/s.

Samitoo

Aug 17, 2023

Thank you very much for your answer !

So if I understand correctly, to use the TheBloke/Llama-2-13B-chat-GPTQ model, I would need 10GB of VRAM on my graphics card. But is there a way to load the model on an 8GB graphics card for example, and load the rest (2GB) on the computer's RAM?

In addition, how many simultaneous requests on a 4096 input can be performed on this model with a 24GB 3090? I know the model will load 10GB onto the board, plus 3GB of runtime kernel, and some for the query. But if the card is loaded at 14GB, there are 10GB left, if a request requires 1GB of space, does this mean that I could manage 10 requests simultaneously?

Thank you very much for your answers and your work!

mikeyang01

Sep 12, 2023

Thank you very much for your answer !

So if I understand correctly, to use the TheBloke/Llama-2-13B-chat-GPTQ model, I would need 10GB of VRAM on my graphics card. But is there a way to load the model on an 8GB graphics card for example, and load the rest (2GB) on the computer's RAM?

In addition, how many simultaneous requests on a 4096 input can be performed on this model with a 24GB 3090? I know the model will load 10GB onto the board, plus 3GB of runtime kernel, and some for the query. But if the card is loaded at 14GB, there are 10GB left, if a request requires 1GB of space, does this mean that I could manage 10 requests simultaneously?

Thank you very much for your answers and your work!

Not possible on GPTQ, GPTQ only support split between GPUs

saifhassan

Oct 9, 2023

Yes, GPTQ is for running on GPU. Actually, GGML can run on GPU as well. But GPTQ can offer maximum performance.

The GPU requirements depend on how GPTQ inference is done. If you use ExLlama, which is the most performant and efficient GPTQ library at the moment, then:

7B requires a 6GB card

13B requires a 10GB card

30B/33B requires a 24GB card, or 2 x 12GB

65B/70B requires a 48GB card, or 2 x 24GB

Yes you can split inference across multiple smaller GPUs, eg 2 x 24GB is a common way to run 65B and 70B models. It is slower than using one card, but it does work. For example, using ExLlama with 2 x 4090 24GB GPUs can give 18 - 20 tokens/s with 65B and 14-17 tokens/s with 70B. A single 48GB card like an A6000 would likely do 20+ tokens/s.

I have 2 GPUs (Each of 8 GB), and I want to use 13B, please guide how can I use 2 GPU?

endgamefond

Feb 5, 2024

@saifhassan have you figured out the way?

abpani1994

Apr 3, 2024

•

edited Apr 3, 2024

Does exllamv2 allow concurrent request processing?

YaTharThShaRma999

Apr 3, 2024

@abpani1994 yes but for users over like 10, you should probably use some batch inference library like vllm, aphrodite, tgi, sglang. Else exllama v2 should be a fine for few users.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment