TheBloke/falcon-40b-instruct-GPTQ

Aug 9, 2023

Hello @TheBloke , Sorry for bothering you, but i am trying to run this falcon-40b-instruct-GPTQ model on my oobabooga web ui. But i see an error when i try to load this, "RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes." my settings:

entire log:
2023-08-09 22:03:42 INFO:Loading TheBloke_falcon-40b-instruct-GPTQ...
2023-08-09 22:03:42 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-08-09 22:03:42 WARNING:Exllama kernel is not installed, reset disable_exllama to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
2023-08-09 22:03:42 ERROR:Failed to load the model.
Traceback (most recent call last):
File "E:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py", line 179, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "E:\oobabooga_windows\text-generation-webui\modules\models.py", line 78, in load_model
output = load_func_maploader
File "E:\oobabooga_windows\text-generation-webui\modules\models.py", line 292, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "E:\oobabooga_windows\text-generation-webui\modules\AutoGPTQ_loader.py", line 56, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling\auto.py", line 108, in from_quantized
return quant_func(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 817, in from_quantized
model = AutoModelForCausalLM.from_config(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 428, in from_config
return model_class._from_config(config, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1146, in _from_config
model = cls(config, **kwargs)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 693, in init
self.transformer = RWModel(config)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 509, in init
self.h = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 509, in
self.h = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 372, in init
self.mlp = MLP(config)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 352, in init
self.dense_4h_to_h = Linear(4 * hidden_size, hidden_size, bias=config.bias)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes.

My system specs:
RAM: 32GB DDR4 Ram
GPU: Gigabyte Geforce RTX 3060 12GB
CPU: 11th Gen Intel(R) Core(TM) i9-11900F @ 2.50GHz 2.50 GHz
MB: Gigabyte Z590 UD AC

Regards @TheBloke !

TheBloke

Owner Aug 10, 2023

This means you don't have enough RAM on this system. You should be able to work around that by making sure you have a large Pagefile available - I'd recommend at least 100GB Pagefile.

XceptDev

Aug 10, 2023

Hello @TheBloke , Now it successfully loads, but when i try to chat with that ai, i get another weird error. Im gonna dono you if we can resolve this.

Error:
Traceback (most recent call last):
File "E:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "E:\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 307, in generate_with_callback
shared.model.generate(**kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 443, in generate
return self.model.generate(**kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1633, in generate
return self.sample(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2755, in sample
outputs = self(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 759, in forward
transformer_outputs = self.transformer(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 654, in forward
outputs = block(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 396, in forward
attn_outputs = self.self_attention(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Krzysztof/.cache\huggingface\modules\transformers_modules\TheBloke_falcon-40b-instruct-GPTQ\modelling_RW.py", line 252, in forward
fused_qkv = self.query_key_value(hidden_states) # [batch_size, seq_length, 3 x hidden_size]
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\nn_modules\qlinear\qlinear_cuda_old.py", line 271, in forward
out = out + self.bias if self.bias is not None else out
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Exception in thread Thread-7 (gentask):
Traceback (most recent call last):
File "E:\oobabooga_windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "E:\oobabooga_windows\installer_files\env\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "E:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 62, in gentask
clear_torch_cache()
File "E:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 94, in clear_torch_cache
torch.cuda.empty_cache()
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\cuda\memory.py", line 133, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

TheBloke

Owner Aug 10, 2023

Ah yeah, it looks like you don't have enough VRAM so it is trying to split it across CPU and GPU, which is not supported with this model:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

There's no solution for that I believe. You just need more VRAM, or to use another model. Even if you do get this working, the Falcon 40B GPTQs are super slow.

Or you could try my Falcon 40B GGMLs. They won't work in text-generation-webui though.

Personally I no longer recommend Falcon 40B to people. It is slow, has several problems, and has never been properly integrated into Transformers which suggests to me it's not being actively maintained any more.

With the release of Llama 2, it's now possible to get a fully commercially licensed model with much better support and performance, and decent quality. There's no Llama 2 model of the same size as Falcon 40B yet (there's meant to be a Llama 2 34B released at some point, but we haven't heard any more news on that recently), but the 13B is very capable and the 70B is much better than Falcon 40B - if you have a big enough system to run it.

Or if you don't care about commercial licensing, check out one of the many Llama 1 30B/33B models which are very good and also very well supported.

If you're determined to run Falcon 40B in text-generation-webui, the option I would recommend is to download the original unquantised Falcon 40B model (linked in my README) and use Transformers + bitsandbytes load_in_4bit. For that you will need 1 x 48GB GPU or 2 x 24GB GPU. You can load that in text-generation-webui using the Transformers loader; there's a checkbox for load in 4 bit.

XceptDev

Aug 10, 2023

Hello @TheBloke ! Thank you for great suggestions. for that 1 x 48GB GPU or 2 x 24GB GPU, can I just add vram to my gpu through regedit?

TheBloke

Owner Aug 10, 2023

...no. You need real VRAM on a real GPU.

XceptDev

Aug 10, 2023

Hi @TheBloke . I managed to run your llama 13B model. A weird question: Can you make instruction for newbies (like me) on training/fine-tuning it using text generation webui?

Tim1210

Aug 16, 2023

@XceptDev How did you get it to run?

Tim1210

Aug 18, 2023

I couldnt run the model on my 4090 (24GB VRAM)
I was in dependency nightmare mode between pip and NVDIA CUDA, so I went back to the OOBABOOGA repo and followed the instructions to set it up in Docker.
snags for docker setup

need to move files from docker folder to text generation folder
may need to use localhost:7860 not 0.0.0.0:7860 depending on your OS browser

After setting up all the prereqs for docker it nice that dependency management is all on the container, and also a nice extra level of safety. Good luck to everyone out there. I will try again on my local machine after I rebuild my computer

XceptDev

Aug 26, 2023

Guys, I will teoll you soon when I will be back home

AdnanBajwa

Dec 12, 2023

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
import torch
model_name_or_path = "TheBloke/falcon-7b-instruct-GPTQ"

model_basename = "model"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
disable_exllamav2=False,
torch_dtype = torch.float16,
quantize_config=None)

prompt = "Tell me about AI"
prompt_template=f'''A helpful assistant who helps the user with any questions asked.
User: {prompt}
Assistant:'''

print("\n\n*** Generate:")

logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])
I am using this code for inference , i have tried this code with "torch_dtype = torch.float16, " and without this. but get same error every time.

File /media/adnan/66f2dbd6-5cbe-49fb-9bc5-7622a4fbc0f5/Text_Work/updated_python/penv/lib/python3.11/site-packages/transformers/generation/utils.py:1673, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1656 return self.assisted_decoding(
...
31 output = torch.empty((x.shape[0], q4_width), dtype = torch.half, device = x.device)
---> 32 gemm_half_q_half(x, q_handle, output, force_cuda)
33 return output.view(output_shape)

RuntimeError: a is incorrect datatype, must be kHalf

AdnanBajwa

Dec 12, 2023

@AdnanBajwa i have 2 x NVIDIA Quadro RTX 6000 (24GB) and GPU memory 48 GB. and 128 GB RAM.

TheBloke
/

falcon-40b-instruct-GPTQ

A weird bug