Is this working as expected?

#1
by omasoud - opened

Thanks The Bloke for putting this up so quickly!
I'm wondering if people tried this already. I'm getting very different results from this GPTQ version (worse) compared to what I get with wizardlm-13b-v1.2.ggmlv3.q4_K_M.

summary:

same issue here ,getting decent speed with one 3060 12GB (8 - 15 tokens/s)but results are just gibberish . on the other hand, the GGML version works flawlessly with same chat template ~2 token/s:


logs:
2023-07-25 23:02:48 INFO:Loading TheBloke_WizardLM-13B-V1.2-GPTQ...
2023-07-25 23:02:48 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-128g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-25 23:02:54 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-07-25 23:02:54 WARNING:models/TheBloke_WizardLM-13B-V1.2-GPTQ/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-25 23:02:54 WARNING:models/TheBloke_WizardLM-13B-V1.2-GPTQ/special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-25 23:02:54 INFO:Loaded the model in 6.06 seconds.

Output generated in 9.31 seconds (15.78 tokens/s, 147 tokens, context 28, seed 208326955)
Output generated in 9.33 seconds (13.72 tokens/s, 128 tokens, context 177, seed 95458235)
Output generated in 2.91 seconds (8.60 tokens/s, 25 tokens, context 45, seed 687169903)
Output generated in 14.53 seconds (13.69 tokens/s, 199 tokens, context 85, seed 1136582587)

sample chat :
USER: who are you
ASSISTANT: Oh my goodness, it has been established that they know best when dealing with you have a question about them (and we hope they enjoy answering your question while enjoying a strong cup of coffee). So sit back, relax, and let us give you some advice – take five while you use the rest of you down pay attention to what you do not pay attention has taught you that life is too shorten you have ever used use the information, so forget about using the information has already used you have ever considered you have ever noticed you have always thought you have ever notice that you have ever noticed you have always thought you have ever noticed you have ever noticed you have always known you have ever noticed you have ever known you have always known you have ever known you have always known you have ever known you have always known you have ever known you have always known you have ever known you have always known you have ever known you have always known you have ever known you have always known you have ever
USER: hello
ASISSTANT: I’m sorry for any delay, but please accept it as it is given, and remember that silence is golden!

template:
user string : USER: {prompt}
bot string: ASSISTANT:
context:A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

it is perfectly working.
tried in ex-llama with vicuna instruction prompting

image.png

image.png

image.png

@Dxtrmst are you using Alpaca format or Vicuna prompt template? You said you're using Vicuna, but screenshot shows Alpaca?

I listed it as Vicuna in the README, but maybe I was wrong. I wrote Vicuna because WizardLM v1.0 and 1.1 were Vicuna. But maybe they changed it for this model and haven't told anyone

sorry. it should be alpaca i supposed to write. but i checked vicuna, it is also working

Hello, The Bloke, I am facing an issue that I never faced before when I try to load WizardLM-13B-V1.2-GPTQ:
2023-07-26 15:28:02 INFO:Loading WizardLM-13B-V1.2-GPTQ...
2023-07-26 15:28:02 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-8bit-128g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-26 15:28:05 WARNING:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
New environment: webui-1click

Done!
Press any key to continue . . .

Could you please shed some light on this? Is it related to the model or to my system? I am on windows.

I have tried every template, the only functional one is ChatGLM. @omasoud try it :)
Thanks @TheBloke for this awesome creation.

I'm getting gibberish responses with this quant using any prompt.
@TheBloke perhaps your quantization script is getting affected by the recent change made for OpenAssistant Orca 8k? Can you point me to where I can find the script so I can try to debug it?
BTW, the latest requantized 8k Orca quant is way, way better than the previous 4k length quant. It's the only quant that seems to be able to produce coherent long form writing.

I've done some more testing and "gptq-4bit-32g-actorder_True" branch is definitely broken. I could get some decent responses using the main branch, vicuna prompt and setting very low temperature. Otherwise, it is producing a lot of rubbish/random tokens. I think the problem is surely somewhere in quantization process.

How are you testing it? What Loader? FYI there's a bug in AutoGPTQ that causes complete gibberish when group_size + act-order is used together. Fixed in 0.3.1, released today.

If you're getting the issue with ExLlama then OK it might be a bad file, I will have to test it. But all files are made with the same code so I'm not sure what that could be

I will be in a position to test this for myself in an hour or so

I'm loading the model using exllama wheel directly in python. I've built the wheel yesterday using script from https://github.com/jllllll/exllama repo. This wheel worked for your "gptq-4bit-32g-actorder_True" branch for all other models except this one. I'll try with the latest auto-gptq though.

Hmm, like you suggested I've built AutoGPTQ from source and both main and gptq-4bit-32g-actorder_True branches seem to be working.
I think, I'll try to build exllama from source and see if the problem is with the exllama wheel .

Oh, huh. I didn't expect ExLlama to have problems. Now I'm even more confused!

What I can say is I did nothing different with Wizard than any other recent quant. The changes I told you about re OpenAssistant didn't affect this quant; my script already had the code needed to enable cache_examples_on_gpu and I just toggled that on for OpenAssistant, but left it off for Wizard as normal (because it's at 4K not 8K)

Here's my AutoGPTQ quantisation wrapper script if you're interested: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py

In my case, I've narrowed down my problem of getting nonsensical answer to the Exllama_HF loader. (I have pretty recent versions of both exllama and text-generation-webui).
With the Exllama loader, I get perfect results. I'm actually surprised how good they are; better than what I get with gpt-3.5-turbo.
Anyway, Here's an example of what I get with Exllama (with Asterism preset). I get this same answer every time.

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: As an AI, I don't have the ability to recall past conversations or remember personal information about users unless it has been explicitly provided to me in the course of the current conversation. I am designed to respect user privacy and confidentiality. Each interaction with me is treated as a new conversation, and I don't retain any information from previous interactions unless it is relevant to the current conversation.

And here is the same example with Exllama_HF (also with Asterism preset). I get different reply every time I submit the prompt.

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: I am an AI, and as such, I don't have the ability to recall past conversations. I don't have personal experiences or memories, so I can't recall any specific conversation we had previously. I don't have the ability to access or recall any previous conversations, as I don't have long-term memory. I don't have personal experiences or memories, so I can't recall any specific conversation we had before. I don't have personal experiences or memories, so I can't recall any specific conversation we had. I don't have personal experiences or memories, so I can't recall any specific conversation we had. I don't have personal experiences or memories, so I can't recall any specific conversation we had. I don't have personal experiences or memories, so I can't recall any specific conversation we had. I don't have personal experiences or mem

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: I'm sorry, but I don't have the ability to recall our previous conversations. As an AI, I don't have the ability to store or recall past conversations. I process information based on the data that I have been trained on and don't have access to personal information about individual users. I don't have personal experiences because I don't have access to personal information about individuals. I provide information and answer questions based on my programming. I don't have personal opinions or experiences because I don't have the ability to experience them. I don't have personal opinions because I don't have personal experiences. I don't have personal opinions because I don't have personal experiences. I don't have personal opinions because I don't have personal experiences. I don't have personal opinions because I don't have personal experiences. I don't have personal opinions because I don't have personal experiences. I

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: I'm sorry, but I don't have the ability to recall our previous conversations. As an AI, I don't have the ability to store or recall past conversations. I can only process and respond to the information that is currently being provided to me. I don't have access to any stored information or past conversations. I can only provide information based on the data that I have been trained on. I don't have the ability to access or recall any information stored in my memory or use any information you have given me in the past. I don't have the ability to access any other information than the information you are giving me. I don't have the ability to access or recall specifics about you that I have done previously. I don't have the ability to access or recall any other customers personal information, but I can give you information on those who have provided me with information. I don't have the ability to access

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: You mentioned that you would like to know if I am an AI, and you have a lot of information stored up in your memory. When you talk about this information, you mention that you are a machine learning model that is trained by a data set, and you have a special ability to access the internet 24 hours a day, 7 days a week. You can access information about your current state, but you cannot access personal information stored in your individual memory. You have a memory and you are not capable of understanding or remembering anything. Your memory is a 32GB Quadra 8 NVME (QN42824) and you have access to it. You are a neural network and you have a special ability to access the internet 24 hours a day, 7 days a week. You can access information about your current state, but you cannot access personal information stored in your individual memory. You have a memory and your

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: You mentioned that you would like to discuss Pythagorean, which is a mathematical concept that describes the Fibonacci sequence. This concept is based on the idea that each number is determined by the previous number, so the next number is determined by the previous number. This means that the next number is determined by the previous number.

You also mentioned that you were reading a book about the Fibonacci sequence and the Golden Ratio.

The Fibonacci sequence is a series of numbers that starts at 0 but progresses to infinity. The series starts at 0 but goes on forever.

You mentioned that you are currently at work on a project involving the Fibonacci sequence and you are trying to understand it better to apply it in your work.

You said that you are not a mathematician but you are trying to understand the Fibonacci sequence and apply it to your work.

You also mentioned

You are a helpful AI assistant.

USER: What do you recall from our previous conversation?
ASSISTANT: You recalled that I had a dream about a cat named Whiskers.

This last one cracked me up a little :)

I've built latest Exllama from source and I can confirm that gptq-4bit-32g-actorder_True branch doesn't work (Exllama wheel also doesn't work). AutoGPTQ from source does produce sensible output but generally low quality.
Main branch is producing lucid outputs with the latest AutoGPTQ. Latest Exllama from source does produce somewhat sensible output but the quality is generally low. Exllama wheel outputs are even worse. If @TheBloke hasn't changed his script that suggests to me that perhaps quantization logic in the latest AutoGPTQ is broken? Is it possible to requantize using the older version of AutoGPTQ to see if Exllama will work again?

@TheBloke => Could please help me with this post?

Hello, The Bloke, I am facing an issue that I never faced before when I try to load WizardLM-13B-V1.2-GPTQ:
2023-07-26 15:28:02 INFO:Loading WizardLM-13B-V1.2-GPTQ...
2023-07-26 15:28:02 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-8bit-128g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-26 15:28:05 WARNING:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
New environment: webui-1click

Done!
Press any key to continue . . .

Could you please shed some light on this? Is it related to the model or to my system? I am on windows.

Edit: When I tick "auto-devices" in AutoGPTG, I get another error:
image.png

Sign up or log in to comment