speed?

#2
by vince62s - opened

@ybelkada can you confirm you still need 2 GPU to load this, and what kind of speed are you getting with plain python code ?
thanks
(I did the same and getting 13 tok/sec on 2GPU 3090+4090 which is kinda slow)

Hi @vince62s
How did you run inference? I am using this PR for now: https://github.com/huggingface/transformers/pull/27950 as the gate layers need to stay un-quantized

I'm using this PR for now :) https://github.com/OpenNMT/OpenNMT-py/pull/2535/files
but using some llm-awq code to quantize (indeed without the gate layer) works fine, but slow. I'll try to implement QuIP 2-bit to make it fit on a single card.

what annoys me is that with llama2-70B-chat-AWQ I can get 16 tok/sec on the same 2GPU(3090+4090), and this has more active params at inference.
maybe the gating code is not optimized on my side.

EDIT: I'm closer to 20 tok/sec in fact on mixtral AWQ

Sign up or log in to comment