Not supported for TGI > 1.3 ?

#1
by paulcx - opened

the response is empty string

I am also unable to launch this model with TGI in general. The version by ID 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' is working as expected. I assume this is a problem not related to the model type Mixtral.

Ooh, nice @cdawg - that was a great tip. I just got that running well on runpod with this template.

@TheBloke - fyi, there seems to be something about this AWQ that is different to what Younes did. Separately I had issues quantizing AWQ myself (and I'm not sure how you or Younes got around that), issue here.

I am also unable to launch this model with TGI in general. The version by ID 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' is working as expected. I assume this is a problem not related to the model type Mixtral.

@cdawg I did download 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' but I'm running into the same issue:

NotImplementedError: awq quantization is not supported for AutoModel

I'm running text-generation-launcher 1.3.4.

Would you kindly share your config?

You can try running with the "latest" TGI and it should work.

See this template for the config: https://runpod.io/gsc?template=546m57v73a&ref=jmfkcdio

Sorry, I didn't mention that I only have CPUs as part of an oc cluster.

@RonanMcGovern Did you get that running on CPUs?

I'm not sure if you can run AWQ on CPUs because AWQ requires Ampere GPUs.

Maybe there is a way but I sense that running llama cpp is may be easier than using TGI. They have a server you can run as well.

@m-dahab I launch the model with text-generation-launcher --model-id=$MODEL_PATH --port=80 --max-best-of=1 --max-input-length=4096 --max-batch-prefill-tokens=4096 --max-total-tokens=9192 --json-output --validation-workers 4 --quantize=awq. But as @RonanMcGovern pointed out, i think you need a gpu with AWQ supportd. I think A10, A100 or H100 would be good.

When i deploy this on Sagemaker it all builds but I always get an empty string response back. Anyone know why?

use the ybelkada version mentioned above

I have the same problem with that model too, only an empty string is returned back (even though the response takes 10s to come back)

have a look at that runpod template - and check the params. See the readme as well for sample curl calls. Even if you don't use runpod, it will give guidance

This comment has been hidden
This comment has been hidden

@RonanMcGovern thank you very much, that Runpod template works for me.

Do you also have a template for deploying Mixtral AWQ using vLLM by any chance?

Yup, there’s a vLLM one here : https://github.com/TrelisResearch/one-click-llms

This comment has been hidden

@RonanMcGovern thanks a lot...

For the TGI one, any tips on why i keep running out of memory when using it?

For the TGI one, any tips on why i keep running out of memory when using it?

You need at least probably 48 GB of VRAM, and that's with short context length. You can try trimming that max in TGI params and it may help a bit. Otherwise, more VRAM needed.

thanks @RonanMcGovern

your vLLM template works for me now but I am finding the response gets truncated and returned early because of 'finish_reason': 'length'

this is despite me setting high values for --max_tokens and --max-model-len

do you have any idea why this might be?

Sign up or log in to comment