Type of hardware for inference

#4
by jilijeanlouis - opened

Hi, and congrats for your work,
What type of hardware are you running the inference on?

I'm running OOM on a 24GB VRAM card, even with 4bits, using the 7b model

I'm able to load the 7b model in 4 bit (will require a recent card IMO) on an Nvidia A10G utilizing 6.9 GB VRAM (out of 24)

There is a difference in the token generation rate for this space compared to the sample ipynb. What is the difference? There is some optimization done on the inference endpoint?

We are using https://github.com/huggingface/text-generation-inference to power the inference backend for this space.
The model is currently sharded using tensor parallelism on 4xA10s.

Sign up or log in to comment