What instance type do I need to deploy this as Inference Endpoint?

#18
by clang-kodex - opened

I tried deploying this as inference endpoint on a large GPU (4 x Nvidia Tesla T4). For any smaller instance I get Warning: Model may be too large for the selected Instance Size.. However on the large GPU, the build fails with this error message:

Endpoint error: Endpoint failed to start, reason: Endpoint failed. Check logs or documentation more for more information

And from the logs:

bp6wn 2023-05-02T13:32:59.547Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.76 GiB total capacity; 14.24 GiB already allocated; 23.75 MiB free; 14.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Does this mean I need a larger instance (Nvidia A100)?

Sign up or log in to comment