About launch time out

#4
by hysts HF staff - opened
Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

@chris-rannou

We are now updating this Space to make the second stage model available, but downloading and loading the second stage model increases the launch time, and we are getting the following error:

Runtime error
launch timed out, space was not healthy after 30 min

Could you make the launch time limit longer?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
โ€ข
edited Jul 28, 2022

Also, how much host memory is available in this Space? Using the second stage model will increase the memory usage too. As for the GPU memory, I checked this app works with 24 GB VRAM, so I think it's OK with A10, but I'm not sure if it will work with the current amount of host memory.

@hysts

I increased the launch timeout but you are right the actual issue is an OOM issue. This space is assigned 46GB of memory. How much memory do you think you need ?
Is the high memory usage only at startup to load the model or does it also consumes a lot of memory at actual runtime ?

I updated the error message to reflect the OOM and increased the memory for the Space to 64GB

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

@chris-rannou
Thanks a lot!

How much memory do you think you need ?

I've tested this app on an A100 instance of GCP with 85GB RAM before pushing it, so 85GB is definitely sufficient, but I wasn't sure how much is the necessary amount. But it seems to be working with 64GB host memory now. Thanks.

Is the high memory usage only at startup to load the model or does it also consumes a lot of memory at actual runtime ?

It consumes a lot of memory at runtime too. When I run the app in an instance mentioned above, it consumes about 40-50GB memory.

hysts changed discussion status to closed

The space seems to stabilize around 54GB memory but with a few spikes that went beyond the 64GB limit.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Thanks for the info. I was encountering CUDA OOM when I ran the app with a larger batch size, but now it's fixed and seems to be working. Thanks for your help.

Sign up or log in to comment