You're blazing fast!

#1
by Thireus - opened

I was just reading about Guanaco via https://github.com/artidoro/qlora & https://www.reddit.com/r/LocalLLaMA/comments/13r7pzg/gptqlora_efficient_finetuning_of_quantized_llms/. And you're already dropping the quantized models. Really impressive dedication! Keep up the good work.

TopBloke!

File name refers to 65b based model, model card info says 33b.

Thanks guys!

File name refers to 65b based model, model card info says 33b.

Thanks, fixed

The model card informs us that it is possible to run with under 24GB of VRAM. Is that correct?

Sorry that was a copy and paste error. It's not possible to load a 65B model in under 24GB VRAM - you can with a 30B/33B model.

To fully load a 65B model into VRAM needs 48GB VRAM, either 1 x 48GB GPU (eg L40 or A6000), or 2 x 24GB GPU (eg 4090 or 3090). Or you can use CPU offloading, but that's a lot slower.

Or you could try a GGML model, with CUDA acceleration. Then you can load as many layers as possible to the GPU - probably around 50-60 - and performance will be much faster than it would be with GPTQ 65B on a 24GB card (but probably not as fast as a 65B model on 2 x 24GB GPUs, or a 33B GPTQ model on 1 x 24GB)

@TheBloke Do you know if GGML supports 2 GPUs? I have 2x4090 and while 65B loads, my max context is about 1300 size (or 1700 with multigen) with groupsize 128, or 1500-1800 with g128. With some layers on the CPU I'm able to do it with pre_layer at 2048 context, but it is a lot slower. GGML should be faster in that case.

@BGLuck Not yet, no. But judging by the rate they're adding new features, I wouldn't be at all surprised if it was added soon.

Regarding 2 x 4090 with 65B - are you using CUDA or Triton? If CUDA, try Triton. I found it can lower VRAM requirements. Also, you have to get the layers balanced and that can take some fine tuning. You need to put fewer layers on GPU 1 because that also runs the context.

A few days ago I was able to get a full 2000 context out of a 65B model with 2 x 4090. I can't remember the exact details but I can try again soon.

But in order to achieve that it needs to be a no-groupsize model. All my 65Bs have group_size = none (-1). 128 will use a lot more VRAM.

@TheBloke I'm on Windows, so sadly I'm only on the CUDA branch. I can test in another drive with Linux.

I do about 16/21 on each GPU. Gonna keep testing though.

@Thireus bro how did you load this model,

i followed

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)

prompt = "who is the first president of US"
prompt_template=f'''### Instruction: {prompt}

Response:'''
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

it's giving answers very slowly i mean the inference time,
can you suggest the way to make it very fast

Sign up or log in to comment