CUDA error when trying to Pre-Train

#4
by Saptarshi7 - opened

Hi, when I try to run luke-base for masked language modelling, I'm continuously getting this error:

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx

I've tried updating transformers, torch, accelerate, reducing batch size. However, nothing seems to work. Could you tell me what could be the issue?

I tried changing from AutoModelForMaskedLM to LukeForMaskedLM, but I'm guessing we can't pretrain without entity tokens correct?

Studio Ousia org

The LUKE model should work without entity inputs.
The error above seems to be related to CUDA issues.
Is it possible that there is a software and hardware mismatch causing this error?

Hi, Thank you for responding. Unfortunately, I have tried changing everything i.e updating all related packages - transformers, accelerate, torch, etc. but still get the same error...

Studio Ousia org

Okay, I understand.
Does this issue persist with other models such as BertForMaskedLM?
If not, I'll investigate whether a particular operation in LUKE is causing the error.

It appears that similar errors have been resolved by downgrading torch. It might be worth trying that approach.
https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341

Thank you for the response. I believe I tried that as well i.e. trying to run on torch 1.8. However, the error persisted. I'll look at 1.7 though (which was mentioned in their link). And no, the problem did not occur for BERT, RoBERTa or distilbert

Studio Ousia org

I'm wondering where the error is occurring inside the model.
Do you observe any more specific errors when you set the environmental variable CUDA_LAUNCH_BLOCKING=1?
(This sometimes gives you more detailed error messages.)

Hello, sorry for the late reply. However, I'll try my best to get back to you with a more detailed error message.

Sign up or log in to comment