ZeroGPU error

#2
by cbensimon HF staff - opened

ZeroGPU has recently migrated its GPUs from single A10G to half A100.
It looks like celldetection GpuStats class util is not compatible with MIG and makes the Space crash on startup

Owner

Thanks for letting me know! I removed it for now

ericup changed discussion status to closed

Nice! Out of curiosity, could GpuStats work with MIG ? (I'm not exactly sure what it does but it might be the case that some metrics are not available with MIG compared to what is available with a full device)

Owner

Yes, I've now added MIG support! I switched to Nvidia's implementation of pynvml (nvidia-ml-py). However, utilization rates for MIG are still not available. Additionally, it's necessary to iterate over the instances. Without doing so, pynvml tries to aggregate information for each device, which apparently requires certain privileges that I don't have.
Thanks again for bringing this up!

Sign up or log in to comment