Apply for community grant: Academic project (gpu)

#2
by rayli - opened
Owner

Hi Hugging Face team,

Thanks for granting a Zero GPU for this project. I have a question regarding the Zero GPU: it seems like sometimes there is an issue of "GPU task aborted" when the space is restarted. And I need to rebuild the space from scratch to get this resolved. Whether the app works seems a bit like a black box to me (as in, when will the issue appear and when not). Could you give me some insights about this?

Many thanks!
Ray

Owner

Screenshot 2024-03-23 at 22.01.13.png

Here for example the app works for a few times but failed soon after. I didn't change anything.

Hi @rayli Thanks for testing ZeroGPU!
We had an infra issue that caused the error randomly from time to time, which is supposed to be fixed now, so it might be related. But if the error persists after restarting your Space, there might be another reason.

The "GPU task aborted" error is raised when the function decorated with @spaces.GPU takes longer than the time specified with duration, so it might be avoided by increasing the duration you set.
It seems that you are explicitly moving your model to CUDA here, but this might increase the inference time of your function, so I think you should remove it. (You still need to call .to("cuda") when instantiating your model, though). On ZeroGPU, CUDA is not available outside of the function decorated with @spaces.GPU, but it sort of remembers that the model is needed to be loaded to CUDA if you called .to("cuda") when instantiating it, and moves it to the device when the function with @spaces.GPU is called. The model will be kept on GPU for a while so it can process the next request quickly, but it will be off-loaded to CPU after a while if it's not used.

As an example, you might want to take a look at the LLaVA-NeXT Space:
https://huggingface.co/spaces/merve/llava-next/blob/722a46407d42f7db83335ec1dd53281148fb1db7/app.py#L12-L13
https://huggingface.co/spaces/merve/llava-next/blob/722a46407d42f7db83335ec1dd53281148fb1db7/app.py#L29

@rayli I looked into the issue and opened a PR to fix the error. https://huggingface.co/spaces/rayli/DragAPart/discussions/3
Can you check and merge it? I've already tested this PR in a separate duplicate Space.

Owner

Thank you @hysts !

Sign up or log in to comment