how do i deploy 8 bit quantized model to sagemaker?

#20
by beksin - opened

here is my model config for sagemaker:

config = {
'HF_MODEL_ID': "meta-llama/Meta-Llama-3-70B-Instruct",
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(2048),
'MAX_TOTAL_TOKENS': json.dumps(4096),
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),
'HUGGING_FACE_HUB_TOKEN': hf_token,
'HF_MODEL_QUANTIZE': "bitsandbytes",
}

referred to this guide: https://www.philschmid.de/sagemaker-llama-llm#3-hardware-requirements

but i am trying to deploy quantized model with 'HF_MODEL_QUANTIZE': "bitsandbytes" on a g5.12xlarge machine (4 A10s, 96GB VRAM) and figured this should be fine for 8 bit 70b model

getting a bunch of errors in cloudwatch logs but they aren't super clear, am I using this right machine for this? is my config wrong?

beksin changed discussion status to closed

Sign up or log in to comment