Yi-34B-Chat-4bits fails to deploy in AWS Sagemaker

#36
by angeligareta - opened

I am trying to deploy the new quantized versions to Sagemaker to experiment with them. When I specify this configuration:

config = {
"HF_MODEL_ID": "01-ai/Yi-34B-Chat-4bits"
'SM_NUM_GPUS': json.dumps(4),
'QUANTIZE': 'awq',
}

I get the following error:

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 201, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 68, in init
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 478, in init
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 416, in init
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 417, in
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 353, in init
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 233, in init
self.query_key_value = load_attention(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 154, in load_attention
return _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 195, in _load_gqa
get_linear(weight, bias=None, quantize=config.quantize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 332, in get_linear
linear = WQLinear(
NameError: name 'WQLinear' is not defined

Is something else needed in the config or is there something missing in the huggingface image?

Thank you in advance!

P.S GPTQ version (01-ai/Yi-34B-Chat-8bits) worked out of the box!!

You need to update your transformers to the latest version. We tested running AWQ version on transformers==4.35.2, and it worked fine.

Oh, it seems like you're having an issue with the TGI image. You can keep an eye on the progress of the issue https://github.com/huggingface/text-generation-inference/issues/1234

Thank you, I will keep an eye

angeligareta changed discussion status to closed

Sign up or log in to comment