Question about inference on CPU
#134
by
XiangD-OSU
- opened
The hosted inference API mentions "The model is loaded and running on Intel Xeon Ice Lake CPU." and seems the latency is surprisingly low. Does anyone has pointers to where I could find more information about deploying to CPU only environment, or actually the service is still hosted with GPUs?
This service is hosted on GPUs. I don't think you'll be able to run inference at that scale on CPUs. @Narsil wrote a great blog on how we deployed it https://huggingface.co/blog/bloom-inference-optimization
Also, thank you for noticing the GUI bug! It's being taken care of: https://github.com/huggingface/hub-docs/pull/477
christopher
changed discussion status to
closed