Text Generation
Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` issue

#8
by donnice849 - opened

I kept seeing this error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I doubt it's because of out of memory, because I also see the following:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.79 GiB total capacity; 4.66 GiB already allocated; 40.94 MiB free; 4.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

which is also weird, because there is enough free memory for allocate... anyone see the same error?

Answer for myself: I found this post and update the device map by deducting one layer which is using GPU. Now it works fine:
https://huggingface.co/facebook/galactica-6.7b/discussions/7#6390f3fcde25f9eda5714014

donnice849 changed discussion status to closed

Sign up or log in to comment