Not supported for TGI > 1.3 ?

by paulcx - opened Dec 26, 2023

Discussion

paulcx

Dec 26, 2023

the response is empty string

RonanMcGovern

Jan 9

same

cdawg

Jan 9

•

edited Jan 9

I am also unable to launch this model with TGI in general. The version by ID 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' is working as expected. I assume this is a problem not related to the model type Mixtral.

RonanMcGovern

Jan 9

Ooh, nice @cdawg - that was a great tip. I just got that running well on runpod with this template.

@TheBloke - fyi, there seems to be something about this AWQ that is different to what Younes did. Separately I had issues quantizing AWQ myself (and I'm not sure how you or Younes got around that), issue here.

m-dahab

Jan 15

I am also unable to launch this model with TGI in general. The version by ID 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' is working as expected. I assume this is a problem not related to the model type Mixtral.

@cdawg I did download 'ybelkada/Mixtral-8x7B-Instruct-v0.1-AWQ' but I'm running into the same issue:

NotImplementedError: awq quantization is not supported for AutoModel

I'm running text-generation-launcher 1.3.4.

Would you kindly share your config?

RonanMcGovern

Jan 15

You can try running with the "latest" TGI and it should work.

See this template for the config: https://runpod.io/gsc?template=546m57v73a&ref=jmfkcdio

m-dahab

Jan 15

Sorry, I didn't mention that I only have CPUs as part of an oc cluster.

@RonanMcGovern Did you get that running on CPUs?

RonanMcGovern

Jan 15

I'm not sure if you can run AWQ on CPUs because AWQ requires Ampere GPUs.

Maybe there is a way but I sense that running llama cpp is may be easier than using TGI. They have a server you can run as well.

cdawg

Jan 15

@m-dahab I launch the model with text-generation-launcher --model-id=$MODEL_PATH --port=80 --max-best-of=1 --max-input-length=4096 --max-batch-prefill-tokens=4096 --max-total-tokens=9192 --json-output --validation-workers 4 --quantize=awq. But as @RonanMcGovern pointed out, i think you need a gpu with AWQ supportd. I think A10, A100 or H100 would be good.

p-christ

Jan 28

When i deploy this on Sagemaker it all builds but I always get an empty string response back. Anyone know why?

RonanMcGovern

Jan 29

use the ybelkada version mentioned above

p-christ

Jan 29

I have the same problem with that model too, only an empty string is returned back (even though the response takes 10s to come back)

RonanMcGovern

Jan 29

have a look at that runpod template - and check the params. See the readme as well for sample curl calls. Even if you don't use runpod, it will give guidance

p-christ

Jan 29

This comment has been hidden

p-christ

Jan 29

This comment has been hidden

p-christ

Jan 29

@RonanMcGovern thank you very much, that Runpod template works for me.

Do you also have a template for deploying Mixtral AWQ using vLLM by any chance?

RonanMcGovern

Jan 29

Yup, there’s a vLLM one here : https://github.com/TrelisResearch/one-click-llms

p-christ

Jan 30

This comment has been hidden

p-christ

Jan 30

@RonanMcGovern thanks a lot...

For the TGI one, any tips on why i keep running out of memory when using it?

RonanMcGovern

Jan 30

For the TGI one, any tips on why i keep running out of memory when using it?

You need at least probably 48 GB of VRAM, and that's with short context length. You can try trimming that max in TGI params and it may help a bit. Otherwise, more VRAM needed.

p-christ

Jan 30

thanks @RonanMcGovern

your vLLM template works for me now but I am finding the response gets truncated and returned early because of 'finish_reason': 'length'

this is despite me setting high values for --max_tokens and --max-model-len

do you have any idea why this might be?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment