TheBloke/Llama-2-70B-Chat-GPTQ · Inference time with TGI

Thanks for posting this model, I was able to run inference with TGI on a single 40 GB A100 with the following command:

docker run \
    -p 8080:80 \
    -e GPTQ_BITS=4 \
    -e GPTQ_GROUPSIZE=1 \
    --gpus all \
    --shm-size 5g \
    -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-chat-GPTQ \
    --max-input-length 4096 \
    --max-total-tokens 8192 \
    --quantize gptq \
    --sharded false

This was able to generate a response at 225ms/token. However, when running the unquantized model sharded across 4 A100s I was able to get around 45ms/token. Am I missing a config or environment variable that would improve the inference time or is this expected behavior with this quantization?