speed?

by vince62s - opened Dec 13, 2023

Dec 13, 2023

•

edited Dec 13, 2023

@ybelkada can you confirm you still need 2 GPU to load this, and what kind of speed are you getting with plain python code ?
thanks
(I did the same and getting 13 tok/sec on 2GPU 3090+4090 which is kinda slow)

ybelkada

Owner Dec 13, 2023

Hi @vince62s
How did you run inference? I am using this PR for now: https://github.com/huggingface/transformers/pull/27950 as the gate layers need to stay un-quantized

vince62s

Dec 13, 2023

•

edited Dec 13, 2023

I'm using this PR for now :) https://github.com/OpenNMT/OpenNMT-py/pull/2535/files
but using some llm-awq code to quantize (indeed without the gate layer) works fine, but slow. I'll try to implement QuIP 2-bit to make it fit on a single card.

what annoys me is that with llama2-70B-chat-AWQ I can get 16 tok/sec on the same 2GPU(3090+4090), and this has more active params at inference.
maybe the gating code is not optimized on my side.

EDIT: I'm closer to 20 tok/sec in fact on mixtral AWQ

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment