Long waiting time

#9
by wempoo - opened

Hi, I've used the example that you provided to run TheBloke/Llama-2-70B-GPTQ, and it looks like it works but it takes a long time to get any result. I changed the prompt text to Hello, and tested the script by running python app.py.

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/Llama-2-70B-GPTQ"
model_basename = "gptq_model-4bit--1g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        inject_fused_attention=False, # Required for Llama 2 70B model at this time.
        use_safetensors=True,
        max_memory={ 0: "24GIB", 1: "24GIB", 2: "24GIB", 3: "24GIB" },
        trust_remote_code=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

"""
To download from a specific branch, use the revision parameter, as in this example:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-4bit-32g-actorder_True",
        model_basename=model_basename,
        inject_fused_attention=False, # Required for Llama 2 70B model at this time.
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        quantize_config=None)
"""

prompt = "Hello"
prompt_template=f'''{prompt}
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

However, I've been waiting a very long time for any results. After approximately 2 minutes, I received a response similar to this:

Hello

I have a problem with my code. I have a table with 2 columns. I want to display the data from the table in a list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view with 2 text views. I want to display the data from the table in the list view. I have a list view
Pipeline:
Hello
  }
}
\end{code}
Comment: I'm not sure if this is the best way to do it, but you can use a `static` method in your class. This will allow you to call that method without creating an instance of the object.

I've also tested on this server text-generation-webui configured with the same model, and it works really speedily in comparison. What I am doing wrong?

What hardware are you testing on? And please show the full output you received from AutoGPTQ, including all log messages produced by AutoGPTQ. If it's very slow it could be because your AutoGPTQ extension is not properly compiled, which would be indicated with the message "CUDA extension not installed".

I've tested it on the same docker container and the same hardware (https://instances.vantage.sh/aws/ec2/g5.24xlarge) that I've run the text-generation-webui. I'm running the text-generation-web-ui server using this command (I'm putting it as an example to show that I'm using the GPU-memory that allows me to share more graphic cards):
python3 server.py --loader autogptq --model TheBloke_Llama-2-70B-GPTQ --no_inject_fused_attention --listen --gpu-memory 20 20 20 20

So, to run it into your python script, I've added the max_memory parameter directly to the AutoGPTQForCausalLM.from_quantized function. Anyway, I've also tried to restart the docker container to make sure that the text-generation-webui doesn't utilize any video memory, as I've stopped the server. Then, I've changed the prompt for test purposes as shown in the screenshots below:
image.png
image.png

It takes a long time. Do you have any ideas on how to address this? And how can I improve the prompt to obtain better answers to the question provided? In the text-generation-webui, the same example works more effectively.

I can't see any obvious problem. Your CUDA extension is installed OK.

Loading a model across 4 GPUs is always going to be slow, much slower than one. It's also unnecessary in this case: you only need two x 24GB GPUs for this model, not four. So you should only load it on two GPUs, which might be a bit quicker than four. (Adding more GPUs doesn't speed it up.)

I wonder if the difference in perceived performance is simply because text-generation-webui streams the answers to you, and with AutoGPTQ you have to wait until the entire answer is generated. And/or maybe AutoGPTQ is generating longer answers.

For much better performance, I strongly suggest using ExLlama instead, in text-generation-webui. Install ExLlama (or use my 'oneclick' Docker container which has text-gen + AutoGPTQ + ExLlama already installed: https://github.com/TheBlokeAI/dockerLLM) and then use --loader exllama in text-generation-webui. You would then want to configure a split between GPUs of with --gpu-split 17.2 24 - meaning 17.2GB on GPU1, 24GB on GPU 2, and nothing on the others. We put less on GPU1 because it also has to hold context.

That will give you 12 - 15 tokens/s in text-generation-webui with this model.

If you want to access it from Python code, you could then use the text-generation-webui API, allowing you to use the text-generation-webui as a loader but then still access it from Python code. There are example Python scripts that use the API here: https://github.com/oobabooga/text-generation-webui/tree/main/api-examples.

As to why your output is bad - remember that this is a base model, not a fine tuned one. It isn't expected to be great at answering questions or following instructions. I believe you can get some better results with suitable prompt engineering, but I don't know exactly what prompts to use for that. You might want to try a fine tuned model instead, like FreeWilly 2 or Airoboros L2 70B GPT4 1.4. I have done GPTQs for both of those, and they can be run identically to this model.

Thank you for your detailed response! I really appreciate it! I'll let you know how it went :)!

This comment has been hidden

I've tested the method you suggested and it works, thank you! The only issue I've encountered is that when I send a request to localhost:5000 or use the UI, it seems that all the requests are being queued. The first request takes around 7 seconds to resolve, and the second request sent immediately after the first takes around 14 seconds to resolve. Is there a way to change the queue limit or increase the maximum number of prompts processed?

I don't think text-generation-webui supports batching yet - ie it can't generally run more than one query at a time. For that you will need something more sophisticated, like vLLM or Huggingface's Text Generation Infernece. vLLM works only with unquantised models at the moment, you can't use GPTQ. So you'd need huge hardware for 70B.

Text Generation Inference supports GPTQ and it recently got fixed to work with Llama 2 70B, so you could try that.

You would then want to configure a split between GPUs of with --gpu-split 17.2 24 - meaning 17.2GB on GPU1, 24GB on GPU 2, and nothing on the others. We put less on GPU1 because it also has to hold context.

What about 4 GPUs? I have 4 3090s which I intend to host for collab, do I go with 17.2,24,24,24?

You would then want to configure a split between GPUs of with --gpu-split 17.2 24 - meaning 17.2GB on GPU1, 24GB on GPU 2, and nothing on the others. We put less on GPU1 because it also has to hold context.

What about 4 GPUs? I have 4 3090s which I intend to host for collab, do I go with 17.2,24,24,24?

I went with 7,10.5,10.5,10.5
Seems to work, workload is distributed among 4 GPUs.

gpusplit2.png

@TheBloke I've used Text Generation Inference and it works really well! However, I've also tried different setups:

  1. I used a single GPU with 40 Video Memory (GiB) and ran the LLama 13B model.
  2. I used 4 GPUsa with 24 x 4 = 96 Video Memory (GiB), splitting the workload across them.

The results are better in the second setup (prompt generation is faster). I tested it using the same prompt and configuration, and it seems that splitting the model across multiple graphic cards has no negative effect.

Hello, I notice you mention use "7,10.5,10.5,10.5" GB Menory could run a 70b GPTQ model. It's 38,5 GB in total. Which means if I have a A100 40GB, llama-70b-chat-gtq could run on a single 40G A100 card? Thank you i advance.

@HelloJiang there is also context to take into account. That is why he put only 7 on GPU0, and 10.5 on the others. Allowing 3GB for context

But yes I think 40GB is enough for the 4bit-128g models, at least using ExLlama. A 128g model is around 36.5 GB in size, so in 40GB you have ~3.5GB for context which should be enough. So yes you should try it on one 40GB GPU, I think it will work.

It is taking more than 30 minutes for llama2 to generate text from a retrieval chain, can someone let me know why that is and what I can do to make this process much faster?

I have the same problem of "TheBloke/Llama-2-70B-GPTQ" working like a charm inside the text generation web-ui, but being very slow in a python session. Still no solutions to this?

Sign up or log in to comment