Errors when deploying to AWS Sagemaker

#60
by djokowsj90 - opened

I got an error when deploying this model to AWS Sagemaker.

"No safetensors weights found for model bigcode/starcoder at revision None. Converting PyTorch weights to safetensors."

It seems Sagemaker expects one bin file "model.pth" or "pytorch_model.bin"
but this repo has many bin files like "pytorch_model-00003-of-00007.bin" etc..
I don't think I can simply contact those bin files.
Anyone has encountered this issue?

I also faced, don't know how to solve it

I passed this error.
Sagemaker will actually do the conversion for you. But you need to give it more time.

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    container_startup_health_check_timeout=1200,
  ) 

Set up the container_startup_health_check_timeout to a bigger number and it will pass this error.

But I encountered the next error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 
21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I upgraded to a bigger instance type, and played with param PYTORCH_CUDA_ALLOC_CONF but the error persisted.
Let me know if you see the same error.

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

Yes, I got it worked with these configs. Thank you so much~

djokowsj90 changed discussion status to closed

Sign up or log in to comment