AWQ model in text-generation-webui

#1
by sdranju - opened

Hello,

Seems text-generation-webui do not support AWQ quantized model. Do you have any idea to make a workaround?

regards.

As discussed in the README, vLLM only supports Llama AWQ models at this time

@TheBloke They don't however support AWQ Mistral yet apparently.

ValueError: Quantization is not supported for <class 'vllm.model_executor.models.mistral.MistralForCausalLM'>.

They should do - version 0.2 just pushed a few hours ago with Mistral support listed. I updated my README a minute ago to say it now worked

Are you running 0.2?

I am running version 0.2.0, well more accurately I'm running from source main which I can see is tagged at 0.2.0.

Successfully built vllm
Installing collected packages: vllm
Successfully installed vllm-0.2.0
root@21cb0f50ccdf:/vllm# cd ..
root@21cb0f50ccdf:/# python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
WARNING 09-29 16:49:26 config.py:341] Casting torch.bfloat16 to torch.float16.
INFO 09-29 16:49:26 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-v0.1-AWQ', tokenizer='TheBloke/Mistral-7B-v0.1-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/api_server.py", line 74, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 270, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers(distributed_init_method)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 140, in _init_workers
    self._run_workers(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 68, in init_model
    self.model = get_model(self.model_config)
  File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader.py", line 67, in get_model
    raise ValueError(
ValueError: Quantization is not supported for <class 'vllm.model_executor.models.mistral.MistralForCausalLM'>.

Oh, damn. I guess they just added unquantised support.

I'll remove mention of it from the README again!

Sign up or log in to comment