Does not seem to work with TGI sharding

#6
by nazrak-atlassian - opened

Hey @TheBloke , my guess is that this isn't on you, but curious if you were able to get it working. Using TGI latest on EC2 with a g5.12xlarge:
docker run --gpus all --shm-size 1g -p 8080:80 -v /opt/dlami/nvme/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --sharded true --num-shard 4 --quantize gptq

This installs TGI version 1.3. Image SHA is 4cb7c8ab86a48d5445ba2237044a3855d989d94d77224dd4d80bc469f962d2ca and was pushed 2 days ago so I assume it has the latest TGI hotfixes for 1.3.3.

I get an error to do with world size.

2023-12-17T06:28:28.002180Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
2023-12-17T06:28:30.812512Z  INFO text_generation_launcher: Using exllama kernels v1
2023-12-17T06:28:30.816449Z ERROR text_generation_launcher: Error when initializing model
...
    self.model = MixtralModel(config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 749, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 750, in <listcomp>
    MixtralLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 684, in __init__
    self.self_attn = MixtralAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 226, in __init__
    self.o_proj = TensorParallelRowLinear.load(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 478, in load
    weight = weights.get_multi_weights_row(prefix, quantize=config.quantize)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 285, in get_multi_weights_row
    qzeros = self.get_sharded(f"{prefix}.qzeros", dim=0)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 118, in get_sharded
    size % world_size == 0
AssertionError: The choosen size 1 is not compatible with sharding on 4 shards

That's not the full traceback, but I think the important points. As far as you know is there any architectural limitation that would prevent the model from working when sharded?

Just noticed in the model card (sorry missed it the first time, it was pretty far down):

Serving this model from Text Generation Inference (TGI)
Not currently supported for Mixtral models

Can you provide any clarity on what the blocker is here? I know TGI supports Mixtral at a base level and have deployed a non-quantized version.

I'm facing the same issue, any suggestion?

Thanks in advance

@AbRds I've swapped to EETQ for quantization for now. I believe it's supported from TGI 1.3.2 onward.

@AbRds I've swapped to EETQ for quantization for now. I believe it's supported from TGI 1.3.2 onward.

Hi @nazrak-atlassian I'm not familiar with EETQ, how it works? I just have to pass the parameter eetq instead of gptq?
I've tried passing the eetq argument but I have receive another error:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

@AbRds Yes you should be able to change the --quantize arg. EETQ an improved in-place 8bit quant technique (supposedly better performing than bitsandbytes'), but I couldn't speak to it more than that. It worked for me with TGI 1.3.3 with the following docker run:

docker run --gpus all --shm-size 1g -p 8080:80 -v /opt/dlami/nvme/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mixtral-8x7B-v0.1 --sharded true --num-shard 4 --quantize eetq

My guess is that you are trying to use TheBloke's GPTQ model with eetq. It should be run on the original model, and will perform the quantization on the fly. Note that it's only an 8bit quantization, so if you require 4bit for your VRAM requirements it will not work unfortunately.

@AbRds Yes you should be able to change the --quantize arg. EETQ an improved in-place 8bit quant technique (supposedly better performing than bitsandbytes'), but I couldn't speak to it more than that. It worked for me with TGI 1.3.3 with the following docker run:

docker run --gpus all --shm-size 1g -p 8080:80 -v /opt/dlami/nvme/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mixtral-8x7B-v0.1 --sharded true --num-shard 4 --quantize eetq

My guess is that you are trying to use TheBloke's GPTQ model with eetq. It should be run on the original model, and will perform the quantization on the fly. Note that it's only an 8bit quantization, so if you require 4bit for your VRAM requirements it will not work unfortunately.

@nazrak-atlassian you're right, I was trying to run Thebloke's version instead of the original one, now works perfectly.

Thanks a lot.

Any update on when GPTQ might work?

It seems that all of TheBloke's GPTQ models are still broken with TGI serving

Sign up or log in to comment