very slow inference speed

#1
by tunggad - opened

Did anyone try this model - TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GPTQ out on local GPU? I tried the HF transformers example as on model card with the variant gptq-3bit-128g-actorder_True on single RTX 3090, CUDA 12.3, torch 2.1.2+cu121, auto-gptq 0.7.1, optimum 1.17.1, transformers 4.38.2

It took more than 10m to produce the text (2133 chars / 384 words - including prompt). All the time VRAM consumption was about 22GB, and GPU load 90% constantly. Is that not abit too slow?

@tunggad huggingface transformers is not that fast so speed will be pretty slow. Use exllama or exllama v2 for faster inference. I would actally reccomend you use 3bpw exl2 mixtral and load it with exllama v2 and you will get much faster speed.(50 tokens per second around?)

Sign up or log in to comment