Inference time with TGI

#15
by jacktenyx - opened

Thanks for posting this model, I was able to run inference with TGI on a single 40 GB A100 with the following command:

docker run \
    -p 8080:80 \
    -e GPTQ_BITS=4 \
    -e GPTQ_GROUPSIZE=1 \
    --gpus all \
    --shm-size 5g \
    -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-chat-GPTQ \
    --max-input-length 4096 \
    --max-total-tokens 8192 \
    --quantize gptq \
    --sharded false

This was able to generate a response at 225ms/token. However, when running the unquantized model sharded across 4 A100s I was able to get around 45ms/token. Am I missing a config or environment variable that would improve the inference time or is this expected behavior with this quantization?

I get the same number of latency (>100ms/ toekn) with half the length for input and total_tokens. It even slower than using quantization with bitsandbytes-nf4 (~51 ms/token).
This is weird as it is said that gptq is faster for inference than bitsandbytes.
...

Sign up or log in to comment