TheBloke/Falcon-180B-Chat-GPTQ · 3bit on single A100 good quality, poor speed

Sep 10, 2023

Hi, just want to provide a little feedback. I was able to run the 3bit version on a single A100 80GB and the quality from my short tests looks incredibly close to the full 16bit version (tried on the HF space), the model’s quality looks very close to Llama-2-70B though (as anticipated by the leaderboard). Inference speed looks a bit poor (I obtain 8 token/sec while with llama-70-gptq i get from 80 to 160 token/sec) but I belive it’s normal.

Still it’s amazing how it’s now possible to run models of the size of GPT-3,3.5 on a single GPU.

Lastly I belive with QLora finetuning is also possible on a single A100-80GB, but I didnt try yet.

Thank you for you work.

nsegev

Oct 3, 2023

Have you been able to do this on multi-GPU configuration or only for a single GPU?

Yhyu13

Oct 3, 2023

@Gian-hf What is your avg% gpu usage ratio, checkout by nvtop.

If you are running with hf implementation of transformer, then the gpu usage ratio would only < 30%. I only fond flash attention can use the maximum gpu efficiency

Gian-hf

Oct 3, 2023

Have you been able to do this on multi-GPU configuration or only for a single GPU?

didnt try that yet. but what would be the gain? multi-gpu in my experience doesnt speed up the inference, it only does if you also increase the model's precision (increasing the used VRAM).