Amazon Sagemaker deployment

#11
by jdmiwx - opened

Deploying a SageMaker endpoint with an ml.g5.2xlarge instance, as demonstrated in the provided code sample, is not feasible due to a CUDA out of memory error. It appears that the minimum required configuration for the endpoint is an ml.g5.48xlarge instance, which comes with 8 GPUs.

jdmiwx changed discussion title from Amazon Sagemaker deploymnet to Amazon Sagemaker deployment
NousResearch org

That shouldn't be the case , otherwise sagemaker is bad

@teknium Why? ml.g5.2xlarge is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.

For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.

NousResearch org

@teknium Why? ml.g5.2xlarge is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.

For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.

the provided example inference code sample, which jdmiwx mentioned, does actually quantize to 4bit. But, even in fp16, mixtral shoul fit with his 8 gpu setup, in 8x 24gb you will have 2x the vram size needed to run it

Sign up or log in to comment