flozi00/Llama-2-13B-german-assistant-v2 · Not able to deploy model successfully to inference endpoint due to error and also not sagemaker via script

Jul 22, 2023

hi out there,
was wondering if anyone was able to deploy the model to the inference endpoint via the deploy button or the sagemaker script.
looks like having trouble with the following when looking at my cloudwatch logs:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 22.20 GiB total capacity; 21.21 GiB already allocated; 31.12 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for any help!

flozi00

Owner Jul 22, 2023

That means that the gpu memory is too small

dm-mschubert

Jul 23, 2023

•

edited Jul 23, 2023

Hi @flozi00 ,
i kept struggling to get the flozi00/Llama-2-13B-german-assistant-v2 deployed to aws sagemaker including the endpoint for lambda access.
used the instructions in the deploy dropdown here at huggingface.
upgraded the model to Notebook instance type "ml.g5.4xlarge" but still seem to run out of memory:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 22.20 GiB total capacity; 21.21 GiB already allocated; 31.12 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

i also tried to deploy via inference endpoint but always end up with the following error:

logs are like:

...File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 729, in warmup\n raise RuntimeError(\nRuntimeError: Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`\n"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.751618Z","level":"ERROR","message":"Server error: Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_input_length":1024,"max_prefill_tokens":4096,"max_total_tokens":16000,"name":"warmup"},{"name":"warmup"}]} 2023/07/24 01:05:17 ~ Error: Warmup(Generation("Not enough memory to handle 16000 total tokens with 4096 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`")) 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.799769Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.799803Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"} 2023/07/24 01:05:17 ~ {"timestamp":"2023-07-23T23:05:17.982467Z","level":"INFO","fields":{"message":"Shard 0 terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} 2023/07/24 01:05:17 ~ Error: WebserverFailed 2023/07/24 01:06:00 ~...

any kind of help, tutorial, video, or aws sagemaker snippet would help. also happy to dm if possible.
thank you!

flozi00

Owner Jul 24, 2023

Since i have not the time to debug the aws services, please contact https://www.primeline-systemhaus.de/
primeline is the main sponsor of my research with own datacenters