Memory consumption much higher on multi-GPU setup
I have just deployed this model on a g5.12x AWS instance (with 4 A10G GPUs, each one with 24GB) using this setting:
" GPTQ_BITS=4 GPTQ_GROUPSIZE=32 sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Llama-2-70B-chat-GPTQ --num-shard 4 --quantize gptq --revision gptq-4bit-32g-actorder_True"
From the documentation, it should take 40.66 GB, but my current GPU memory is 17GB for each GPU, in total 68GB.
Can someone explain the reason behind the higher GPU consumption?
If I guess correctly, from my experience, textgen-webui use AutoGPTQ by default with several techniques that increases VRAM usage for sake of inferencing speed. Just checkout the "model" page of textgen-webui and AutoGPTQ loader for deatils.
And still AutGPTQ is a bit slower than ExLLaMAv2_hf loader. With ExLLaMAv2_hf, I can confirm on my local 2x3090 rig, this model consume about 21G/17G after serveral rounds, where as my split is 21G/21G. Would you try that loader instead? There are startup arguments in textgen-webui readme for switching these loaders.