Did anyone get it to run?

#1
by dimaischenko - opened

Did anyone get it to run? My setup:

cuda 11.7, RTX3090 24 Gb

torch==2.1.1+cu118
transformers==4.36.0
auto-gptq==0.6.0.dev0+cu118  [from source:  https://github.com/LaaZa/AutoGPTQ/tree/Mixtral]

Try to load:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
                "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
                model_basename="model",
                revision="gptq-3bit-128g-actorder_True",
                strict=False,  # Tried with and without this parameter. The result is the same
                use_triton=False,
                use_safetensors=True,
                trust_remote_code=False,
                device="cuda:0",
                disable_exllama=True,
                disable_exllamav2=True,
                quantize_config=None)

Get error:

File "/root/venv/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 276, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: QuantLinear() does not have a parameter or a buffer named weight.

Tried the same but with CUDA 12.1 , torch==2.1.1+cu121 and built auto-gptq==0.6.0.dev0+cu121 from source. The same error.

Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.

The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

Build AutoGPT OK with CUDA 12.1, transformers 4.36.0 and torch==2.1.1+cu121 = auto-gptq==0.6.0.dev0+cu121
But model loading failed in text-generation-webui:

Traceback (most recent call last):
File "/home/me/text-generation-webui/modules/ui_model_menu.py", line 208, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 89, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 380, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/AutoGPTQ_loader.py", line 58, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 102, in from_quantized
model_type = check_and_get_model_type(model_name_or_path, trust_remote_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/_utils.py", line 232, in check_and_get_model_type
raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: mixtral isn't supported yet.

I probably missed something to have that:
mixtral isn't supported yet

But what?

@tsalvoch most likely you did not build auto-gptq from the Mixtral-fix git branch. I had the same error when I built it from the master branch

https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

git checkout Mixtral-fix

Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.

The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

@TheBloke Thank you!

@dimaischenko - How did you get this to run on a 3090? with Mixtral-fix it does try to load, but runs out of memory on my 4090.
I do have 2x4090, guess I'll look through the code base to see if/how to specify multiple gpu.

@bdambrosio I am ok with 3090. Even for revision="main", but you can try revision="gptq-3bit-128g-actorder_True" it takes about 19 Gb (example in my first thread post)

Ah, yup, just realized my error. I had loaded a larger version assuming I would use both gpus. Downloading smaller version now, while also trying to figure out syntax of AutoGPTQ .from_pretrained device parameter.

tnx!

Ah - In case anyone else stumbles here - @TheBloke - any ideas?

gptq-4bit-128g-actorder_True 4:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename="model",
use_safetensors=True,
per_gpu_max_memory={0:"20GIB",1:"20GIB"},
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=fquotequotequote[INST] {prompt} [/INST]
print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.1, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

(mistral) bruce@bruce-AI:~/Downloads/alphawave/tests/Sam$ python mixtral-8x-GPTQ.py
MixtralGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
MixtralGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.

*** Generate:
Traceback (most recent call last):
File "/home/bruce/Downloads/alphawave/tests/Sam/mixtral-8x-GPTQ.py", line 31, in
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
File "/home/bruce/miniconda3/envs/mistral/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 447, in generate
return self.model.generate(**kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2897, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

This comment has been hidden
This comment has been hidden

Sign up or log in to comment