Deployment via Sagemaker

#2
by abhatia2 - opened

Has anyone been able to deploy this model successfully via AWS Sagemaker?

AWQ support for TGI Inference was added in this PR: https://github.com/huggingface/text-generation-inference/pull/1019
But still, I'm unable to deploy it.

Getting this error: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Any help/leads are appreciated, thanks!

That error indicates it doesn't know to load it as an AWQ model, and is trying to load it as a full fp16 model.

TGI requires passing the --quantize awq parameter. I'm afraid I have no idea how one is meant to do that on Sagemaker. Check the docs to see if they support TGI parameters, and then see if you can pass --quantize awq or quantize=awq in whatever way you're meant to pass params on Sagemaker.

I was successfully able to deploy the GPTQ model: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GPTQ

In that, we used to pass this parameter as:

    'HF_MODEL_ID':'TheBloke/Llama-2-70B-Chat-GPTQ',
    'SM_NUM_GPUS': json.dumps(8),
    'HF_MODEL_QUANTIZE': 'gptq'
}```

When I tried setting it as 'awq' , I got the following error:
```NameError: name 'WQLinear' is not defined```

I see. My tentative guess is that the version of TGI on Sagemaker hasn't yet been updated for AWQ support, but I don't know for certain.

Does it show the TGI version number used on Sagemaker?

I think this is the version used: [number].dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04

Hmm, that version does have AWQ support.

In that case I'm afraid I have no idea.

I'll do a test with TGI myself shortly

Thanks a lot for your response!

OK I tested TGI 1.1.0 and it works fine with this model in AWQ. So yes it must be a SageMaker specific error - perhaps you can contact their support, or raise it on the TGI Github Issues page?

Logs:

2023-11-10T09:05:39.522435546Z 2023-11-10T09:05:39.522312Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Llama-2-70B-Chat-AWQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), dtype: None, trust_remote_code: false, max_concurrent_requests: 15, max_best_of: 1, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1748, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "8964d39b570b", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-11-10T09:05:39.522544778Z 2023-11-10T09:05:39.522483Z  INFO download: text_generation_launcher: Starting download process.
2023-11-10T09:05:42.315068236Z 2023-11-10T09:05:42.314833Z  INFO text_generation_launcher: Download file: model-00001-of-00004.safetensors
2023-11-10T09:05:42.315114540Z 
2023-11-10T09:07:15.902265462Z 2023-11-10T09:07:15.902062Z  INFO text_generation_launcher: Downloaded /data/models--TheBloke--Llama-2-70B-Chat-AWQ/snapshots/55c2786a75adef2b89bf3157d1517536d817c936/model-00001-of-00004.safetensors in 0:01:33.
2023-11-10T09:07:15.902328389Z 
2023-11-10T09:07:15.902333139Z 2023-11-10T09:07:15.902147Z  INFO text_generation_launcher: Download: [1/4] -- ETA: 0:04:39
2023-11-10T09:07:15.902337888Z 
2023-11-10T09:07:15.906472082Z 2023-11-10T09:07:15.906342Z  INFO text_generation_launcher: Download file: model-00002-of-00004.safetensors
2023-11-10T09:07:15.906511263Z 
2023-11-10T09:08:47.940922762Z 2023-11-10T09:08:47.940757Z  INFO text_generation_launcher: Downloaded /data/models--TheBloke--Llama-2-70B-Chat-AWQ/snapshots/55c2786a75adef2b89bf3157d1517536d817c936/model-00002-of-00004.safetensors in 0:01:32.
2023-11-10T09:08:47.940967880Z 
2023-11-10T09:08:47.940973816Z 2023-11-10T09:08:47.940843Z  INFO text_generation_launcher: Download: [2/4] -- ETA: 0:03:05
2023-11-10T09:08:47.940978566Z 
2023-11-10T09:08:47.945311040Z 2023-11-10T09:08:47.945205Z  INFO text_generation_launcher: Download file: model-00003-of-00004.safetensors
2023-11-10T09:08:47.945330037Z 
2023-11-10T09:10:20.270476788Z 2023-11-10T09:10:20.270307Z  INFO text_generation_launcher: Downloaded /data/models--TheBloke--Llama-2-70B-Chat-AWQ/snapshots/55c2786a75adef2b89bf3157d1517536d817c936/model-00003-of-00004.safetensors in 0:01:32.
2023-11-10T09:10:20.270514782Z 
2023-11-10T09:10:20.270519531Z 2023-11-10T09:10:20.270415Z  INFO text_generation_launcher: Download: [3/4] -- ETA: 0:01:32.333333
2023-11-10T09:10:20.270524280Z 
2023-11-10T09:10:20.274980234Z 2023-11-10T09:10:20.274843Z  INFO text_generation_launcher: Download file: model-00004-of-00004.safetensors
2023-11-10T09:10:20.275017040Z 
2023-11-10T09:11:25.412663457Z 2023-11-10T09:11:25.412473Z  INFO text_generation_launcher: Downloaded /data/models--TheBloke--Llama-2-70B-Chat-AWQ/snapshots/55c2786a75adef2b89bf3157d1517536d817c936/model-00004-of-00004.safetensors in 0:01:05.
2023-11-10T09:11:25.412701450Z 
2023-11-10T09:11:25.412707387Z 2023-11-10T09:11:25.412512Z  INFO text_generation_launcher: Download: [4/4] -- ETA: 0
2023-11-10T09:11:25.412714511Z 
2023-11-10T09:11:25.834379766Z 2023-11-10T09:11:25.834141Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-11-10T09:11:25.834701526Z 2023-11-10T09:11:25.834580Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-11-10T09:11:35.849216024Z 2023-11-10T09:11:35.848950Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-11-10T09:11:45.862731998Z 2023-11-10T09:11:45.862549Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-11-10T09:11:55.876080563Z 2023-11-10T09:11:55.875857Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-11-10T09:12:05.889716455Z 2023-11-10T09:12:05.889505Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-11-10T09:12:15.903991117Z 2023-11-10T09:12:15.903779Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-11-10T09:12:22.123551559Z 2023-11-10T09:12:22.123316Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-11-10T09:12:22.123595489Z 
2023-11-10T09:12:22.211694670Z 2023-11-10T09:12:22.211527Z  INFO shard-manager: text_generation_launcher: Shard ready in 56.375529197s rank=0
2023-11-10T09:12:22.242591895Z 2023-11-10T09:12:22.242378Z  INFO text_generation_launcher: Starting Webserver
2023-11-10T09:12:22.750219183Z 2023-11-10T09:12:22.749887Z  WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-11-10T09:12:22.750259552Z 2023-11-10T09:12:22.749944Z  WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-11-10T09:12:23.139812970Z 2023-11-10T09:12:23.139531Z  INFO text_generation_router: router/src/main.rs:371: Serving revision 55c2786a75adef2b89bf3157d1517536d817c936 of model TheBloke/Llama-2-70B-Chat-AWQ
2023-11-10T09:12:23.146714771Z 2023-11-10T09:12:23.146544Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-11-10T09:12:34.683371443Z 2023-11-10T09:12:34.683085Z  INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 124880
2023-11-10T09:12:34.683411812Z 2023-11-10T09:12:34.683117Z  INFO text_generation_router: router/src/main.rs:247: Connected

Sure would raise it there, thanks for checking.

One clarification, what does this mean:- " quantize: Some(Awq) "
What value exactly did you set here, I can probably try using the same

That's just how it shows the parameters, a bit weird - they're tagged either None or Some(value)

I set --quantize awq

Update: The error was indeed from the AWS side.
The workaround to get it deployed was to use a HF TGI image directly instead of the sagemaker one

@abhatia2 could you please elaborate more on how you used the image directly ? I am interested in the same use case

You can build a docker image of TGI version 1.1.0 locally and push it to ECR.
Then that can be used while deploying the sagemaker endpoint

Something like this:

huggingface_model = HuggingFaceModel(
image_uri=custom_image_uri,
env=hub,
role=role,
)
Also make sure you're setting this parameter: 'HF_MODEL_QUANTIZE': 'awq'

Thank you very much. I will definetly do that :)

Sign up or log in to comment