Cuda Out of memory issue when deploying mistralai/Mixtral-8x7B-Instruct-v0.1 on AWS "ml.g5.48xlarge"
Hello all,
I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.
For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.
Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Below are some options I tried :
- setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
tried max_split_size_mb = 64,128,512,1024 - setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
as suggested in below blog
https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130 - Using torch.cuda.empty_cache() in the inference script
https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651
Any kind of help or references are really appreciated. Looking forward to it. Thanks
Hello all,
I am a professional AI engineer. I am using the mentioned LLM model on jumpstart and I can produce the responses in average 5 seconds even after enabling all 8 GPUS provided by "ml.g5.48xlarge". My requirement is not further reduce the response time e.g. less than a second.
For this purpose I planned to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 using inference.py file and mentioning device name as CUDA and using same "ml.g5.48xlarge" Ec2 instance to deploy on aws. I am using Sagemaker to write all the code.
Below is the error I am getting:
2024-02-14 T06:26:16,524 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 22.20 GiB total capacity; 1.88 GiB already allocated; 115.12 MiB free; 1.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFBelow are some options I tried :
- setting PYTORCH_CUDA_ALLOC_CONF using max_split_size_mb
e.g. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"
tried max_split_size_mb = 64,128,512,1024- setting PYTORCH_CUDA_ALLOC_CONF as all type of memory management technique e.g. "heuristic"
as suggested in below blog
https://iamholumeedey007.medium.com/memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130- Using torch.cuda.empty_cache() in the inference script
https://community.databricks.com/t5/machine-learning/torch-cuda-outofmemoryerror-cuda-out-of-memory/td-p/9651Any kind of help or references are really appreciated. Looking forward to it. Thanks
Same problem for me ...