TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GPTQ

very slow inference speed

by tunggad - opened Mar 17

Mar 17

Did anyone try this model - TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GPTQ out on local GPU? I tried the HF transformers example as on model card with the variant gptq-3bit-128g-actorder_True on single RTX 3090, CUDA 12.3, torch 2.1.2+cu121, auto-gptq 0.7.1, optimum 1.17.1, transformers 4.38.2

It took more than 10m to produce the text (2133 chars / 384 words - including prompt). All the time VRAM consumption was about 22GB, and GPU load 90% constantly. Is that not abit too slow?

YaTharThShaRma999

Mar 17

@tunggad huggingface transformers is not that fast so speed will be pretty slow. Use exllama or exllama v2 for faster inference. I would actally reccomend you use 3bpw exl2 mixtral and load it with exllama v2 and you will get much faster speed.(50 tokens per second around?)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment