TheBloke/Llama-2-70B-Chat-GPTQ · What GPU is needed for this 70B one?

Jul 19, 2023

Whether RTX A6000 48GB is enough for 70B ?

Jul 19, 2023

enough for me.

Wed Jul 19 22:03:09 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0  On |                  Off |
| 30%   44C    P8              32W / 300W |    805MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:02:00.0 Off |                  Off |
| 44%   76C    P2             298W / 300W |  34485MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1262      G   /usr/lib/xorg/Xorg                          110MiB |
|    0   N/A  N/A      1880      G   /usr/lib/xorg/Xorg                          430MiB |
|    0   N/A  N/A      2009      G   /usr/bin/gnome-shell                         86MiB |
|    0   N/A  N/A      4149      G   ...8417883,14948046860862319246,262144      151MiB |
|    1   N/A  N/A      1262      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      1880      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A     44687      C   python                                    34460MiB |
+---------------------------------------------------------------------------------------+

harpergrieve

Jul 19, 2023

@alfredplpl can you please share how you started it? token lenghts? branch? have the same setup but cant get it loaded...

TheBloke

Owner Jul 19, 2023

Yeah 4-bit uses around 36-38GB VRAM to load, plus context, so 48GB should be plenty

TheBloke

Owner Jul 19, 2023

@harpergrieve check the README again, I recently made updates to it to describe various steps that are needed, eg updating Transformers, and, if you use text-generation-webui or AutoGPTQ from Python code, making sure inject_fused_attention=False is set

harpergrieve

Jul 19, 2023

@TheBloke thanks for the reply, using text gen inference and now getting the model.layers.0.self_attn.q_proj.weight error. Ill try using one of the other branches.

TheBloke

Owner Jul 19, 2023

Did you update Transformers? And is that with Loader: AutoGPTQ?

Also try downloading hte model again (same branch, ie main), just to double check the download is OK

Earlier today I confirmed text-gen-ui works OK with AutoGPTQ + the main file, using "no inject fused attention" and with Transformers updated to latest version - which be aware has to be done inside the Python environment of text-generation-webui, else it won't see the changes.

harpergrieve

Jul 19, 2023

Yep, just updated transformers and it got me past the oom error. Now gettting that self_attn.q_proj.weight error on both main and gptq-4bit-32g-actorder_True. Can the inject_fused_attention=False flag be set through a env var like bits and groupsize?

TheBloke

Owner Jul 19, 2023

Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.

harpergrieve

Jul 19, 2023

@TheBloke Thanks for the help, and thanks for the models! I appreciate your work. Ill try and look into it and report back any findings if i do get it working...

ulymp

Jul 20, 2023

I guess not even the gptq-3bit--1g-actorder_True will fit into a 24 GB GPU (e.g. RTX 3090)?

TheBloke

Owner Jul 21, 2023

Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.

FYI TGI should now work with this model, a PR was merged the other day

I guess not even the gptq-3bit--1g-actorder_True will fit into a 24 GB GPU (e.g. RTX 3090)?

Yeah I don't think it will. You will need 2 x 24GB GPU, or 1 x 48GB GPU. Or an asynchronous setup like 1 x 24GB + 1 x 12GB.

But 1 x 24GB won't fit it I'm afraid. Even the smallest file is 26GB.

Silverspoon7

Jul 24, 2023

Try the binary of ggml.ccp (latest commit). I was able to load the ggmlv3 with 24 gb vram and 40gb additional ram. Got 0,83 token/second on 4090 and i9/9900k on the non-chat version. Oobabooga is not updated / merged yet.

Squeezitgirdle

Jul 28, 2023

•

edited Jul 28, 2023

It's slow (0.8 - 0.9 tokens/s), but with exlammaHF I got it working on a 24gb 4090.

ulymp

Jul 28, 2023

@Squeezitgirdle How did you do that? AFAIK Exllama does not support offloading to CPU RAM. Or is that supported using the HF variant?

MaciejSzulc

Jul 31, 2023

How much 2x separated GPU is slower than one large vram GPU?

Silverspoon7

Aug 1, 2023

•

edited Aug 1, 2023

Depends on gpu model, electrical pci-e slots and cpu, I think. If you have two full pci-e 16x slots (not available on consumer Mainboards) with two rtx 3080, it will depend only on drivers and multi gpu supporting the models loader. Some versions of autogptq may be slow or even not better than with one gpu.
I figured out, that in use of private hobby, a 60-70b model isn’t worth to play with, because the difference to a good 13 or 30b model is not that big. Sometimes, you are missing those little amount of percentage a model does not answer in your language. In this case, you may train it by yourself by simply training some books. Llama-2 7b may work for you with 12GB VRAM. You will need 20-30 gpu hours and a minimum of 50mb raw text files in high quality (no page numbers and other garbage). Today, I did my first working Lora merge, which makes me able to train in short blocks with 1MB text blocks. Training a 13b llama2 model with only a few MByte of German text seems to work better than I hoped. If you insist interfering with a 70b model, try pure llama.ccp. It is faster because of lower prompt size, so like talking above you may reach 0,8 tokens per second. Prompting with 4K history, you may have to wait minutes to get a response while having 0,02 tokens per second. And we are talking about a 4090 gpu. with full multi gpu support and running under Linux, this should get much faster with two of these gpus.

yanmengxiang

Aug 14, 2023

About Llama-2-70B-chat ,fp16, if I have 8*A10(24G),can I run it ,thanks!

TheBloke

Owner Aug 14, 2023

The size of Llama 2 70B fp16 is around 130GB so no you can't run Llama 2 70B fp16 with 2 x 24GB. You need 2 x 80GB GPU or 4 x 48GB GPU or 6 x 24GB GPU to run fp16.

But you can run Llama 2 70B 4-bit GPTQ on 2 x 24GB and many people are doing this.

yanmengxiang666

Aug 14, 2023

hello，what's the need about the RAM in Llama 2 70B fp16？

TheBloke

Owner Aug 14, 2023

I think you only need as much RAM as the size of one shard, which is only about 10GB. 64GB would be fine for example. Generally you won't find machines that have less RAM than VRAM anyway.

yanmengxiang666

Aug 24, 2023

•

edited Aug 24, 2023

my GPU is 16 * A10(16 * 24G). I ask many people to solve this problem,but failed.
url：https://github.com/h2oai/h2ogpt/issues/692
command:CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" python generate.py --base_model=/data/model/llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True
It appears a BUG when I use GPUs > 10:
https://user-images.githubusercontent.com/74184102/262883754-9f065f93-4e54-4708-8584-6b80ccf438ab.png

10 gpu is ok！But more gpu is helpful!
When I use GPU <= 10, it can work! Like this command:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python generate.py --base_model=/data/model/llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True
But I need more gpu because longer prompt need more gpu memmory.Thanks!

gileneo

Aug 28, 2023

•

edited Aug 28, 2023

The size of Llama 2 70B fp16 is around 130GB so no you can't run Llama 2 70B fp16 with 2 x 24GB. You need 2 x 80GB GPU or 4 x 48GB GPU or 6 x 24GB GPU to run fp16.

But you can run Llama 2 70B 4-bit GPTQ on 2 x 24GB and many people are doing this.

so Mac Studio with M2 Ultra 196GB would run Llama 2 70B fp16?

axbon

Aug 30, 2023

Say you have a beefy setup with some 4xL40 gpus or similar, do these need to be connected with nvlink to get good perf or enough to just reside in the same physical box for llama 70b?

softriyaz

Sep 5, 2023

•

edited Sep 5, 2023

I am running on Windows server with xenon processor, 4 Tesla GPUs with each 64 GB. Only one user is able to interact with at a time. The following error appears when another user asks a question or feeds with a prompt while the first one is still processing. Please advise.

Error Encountered
Error occurred during text generation: {"detail":{"msg":"Server is busy; please try again later.","type":"service_unavailable"}}

Saikiran

Oct 7, 2023

Sorry, I misread what you said earlier. Text Generation Inference doesn't work and I don't know of a fix at this time.

FYI TGI should now work with this model, a PR was merged the other day

It's October and it still does not work. The error about self_attn.q_proj.weight still comes while loading 70b chat gptq on text generation inference @Bloke anything I am missing. I am using the latest tgi version docker and required cuda configs as well.

Sprockif

Dec 15, 2023

Hi, I have 2 GPUs of which 1 Nvidia. I want to run Llama2 7b-chat only using Nvidia (Linux Debian system).
I normally run Llama2 with those commands (from this guide https://lachieslifestyle.com/2023/07/29/how-to-install-llama-2/#preparing-to-install-l-la-ma-2)
#conda activate TextGen2
#cd text-generation-webui
#python server.py
could you suggest me How to do?
Thanks :)

Squeezitgirdle

Dec 29, 2023

@Squeezitgirdle How did you do that? AFAIK Exllama does not support offloading to CPU RAM. Or is that supported using the HF variant?

Sorry I'm just now responding.

I have absolutely no idea. I did it once using LM Studio, but that's it. I haven't been able to do it again after updating LM studio.