Not able to deploy model successfully to inference endpoint due to error and also not sagemaker via script

#1
by dm-mschubert - opened

hi out there,
was wondering if anyone was able to deploy the model to the inference endpoint via the deploy button or the sagemaker script.
looks like having trouble with the following when looking at my cloudwatch logs:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 22.20 GiB total capacity; 21.21 GiB already allocated; 31.12 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for any help!

That means that the gpu memory is too small

Hi @flozi00 ,
i kept struggling to get the flozi00/Llama-2-13B-german-assistant-v2 deployed to aws sagemaker including the endpoint for lambda access.
used the instructions in the deploy dropdown here at huggingface.
upgraded the model to Notebook instance type "ml.g5.4xlarge" but still seem to run out of memory:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 22.20 GiB total capacity; 21.21 GiB already allocated; 31.12 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

i also tried to deploy via inference endpoint but always end up with the following error:

Screenshot 2023-07-24 at 01.16.55.png
logs are like:

...File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 729, in warmup\n raise RuntimeError(\nRuntimeError: Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`\n"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.751618Z","level":"ERROR","message":"Server error: Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_input_length":1024,"max_prefill_tokens":4096,"max_total_tokens":16000,"name":"warmup"},{"name":"warmup"}]} 2023/07/24 01:05:17 ~ Error: Warmup(Generation("Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`")) 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.799769Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.799803Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.982467Z","level":"INFO","fields":{"message":"Shard 0 terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} 2023/07/24 01:05:17 ~ Error: WebserverFailed 2023/07/24 01:06:00 ~...

any kind of help, tutorial, video, or aws sagemaker snippet would help. also happy to dm if possible.
thank you!

Since i have not the time to debug the aws services, please contact https://www.primeline-systemhaus.de/
primeline is the main sponsor of my research with own datacenters

Sign up or log in to comment