Cuda Out of memory issue when deploying mistralai/Mixtral-8x7B-Instruct-v0.1 on AWS "ml.g5.48xlarge"

#139

by sonalisbapte - opened Feb 21, 2024

Feb 21, 2024

Hello all,

I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.

For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.

Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below are some options I tried :

setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
tried max_split_size_mb = 64,128,512,1024
setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
as suggested in below blog
https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130
Using torch.cuda.empty_cache() in the inference script
https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651

Any kind of help or references are really appreciated. Looking forward to it. Thanks

anubhav10mishra

Feb 28, 2024

Hello all,

I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.

For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.

Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below are some options I tried :

setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
tried max_split_size_mb = 64,128,512,1024

setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
as suggested in below blog
https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130

Using torch.cuda.empty_cache() in the inference script
https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651

Any kind of help or references are really appreciated. Looking forward to it. Thanks

Same problem for me ...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment