About these warnings...

#7
by mancub - opened

WARNING:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-64g', 'device': 'cuda:0', 'use_triton': False, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': None}
WARNING:The safetensors archive passed at models/thebloke_falcon-7b-instruct-gptq/gptq_model-4bit-64g.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.
WARNING:can't get model's sequence length from model config, will set to 4096.
WARNING:RWGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
WARNING:RWGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.

Do I need to specify more params in when starting text_generation_webui, like group size and should it be using Triton or not (right now it says: False)?

I did compile the CUDA extension from AutoGPTQ and it seems be working ok on my WSL2/3090 setup:

Output generated in 42.66 seconds (3.09 tokens/s, 132 tokens, context 65, seed 201216109)

Of course nothing to write home at 3 t/s but it's a start.

Those warnings are completely normal and expected. But I agree they're not ideal. They cause confusion.

You don't need to specify group_size because that comes from quantize_config.json. Specifying group_size etc would only be needed for models that don't have quantize_config.json. But all my recent ones do, and I'll add it to all old models soon.

The first line is printed by text-gen-ui to show what GPTQ params are. I don't know why ooba is printing it as a warning. It's purely informational.

WARNING:RWGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
WARNING:RWGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.

These lines are because AutoGPTQ tries by default to turn on two performance enhancing features, fused attention and fused mlp. But only Llama currently supports them. So it's telling you that those features were automatically enabled, but aren't available for this model type. Again ideally they shouldn't be printed unless the user actually requested those features

WARNING:The safetensors archive passed at models/thebloke_falcon-7b-instruct-gptq/gptq_model-4bit-64g.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.

This comes automatically from the safetensors library. Ideally AutoGPTQ should suppress this message I guess.

WARNING:can't get model's sequence length from model config, will set to 4096.

This is unique to Falcon. It might be a bug in AutoGPTQ's Falcon support code. It should probably default Falcon to 2048 as that's the correct max sequence length. But it won't affect text-gen will which limit output to 2048 anyway.

Thanks again, learning new stuff all the time :)

mancub changed discussion status to closed

Sign up or log in to comment