I've done the merge and made GGML and GPTQ quantisations for anyone interested

#2
by TheBloke - opened

@TheBloke Really cool! The q4_0 model runs on a AMD Ryzen 9 3950X 16-Core Processor, takes about 20GB memory, and generate a token every ~0.5sec.
I am wondering if you could make a 4bit HF compliant model so I can run it on 2 x RTX 3060 GPU :D

I did make 4bit - that's the GPTQ model! I believe GPTQ supports splitting across two GPUs, though I've not done it personally.

HF doesn't support 4bit, only 16bit and 8bit (with load_in_8bit=True and with bitsandbytes installed). But GPTQ models are for GPU inference like HF models.

You can look at text-generation-webui as a way to load GPTQs with, I think, multi-GPU split. Or for Python inference look at llama_inference.py in the GPTQ-for-LLaMa repo. The code is a bit complex and messy but it should be possible to use it for any general inference.

In the future the way to use GPTQ will be a new, better repo called AutoGPTQ. That allows loading models in a very similar way to standard transformers/HF code. However it definitely doesn't support multi-GPU yet. But that should be coming quite soon.

Hi @TheBloke have you managed to run these models with OpenAssistant's repo? I was hoping that the following will be sufficient

diff --git a/oasst-shared/oasst_shared/model_configs.py b/oasst-shared/oasst_shared/model_configs
.py
index 13b78e17..e062d411 100644
--- a/oasst-shared/oasst_shared/model_configs.py
+++ b/oasst-shared/oasst_shared/model_configs.py
@@ -124,6 +124,12 @@ MODEL_CONFIGS = {
         max_total_length=1792,  # seeing OOMs on 2048 on an A100 80GB
         quantized=True,
     ),
+    "OA_SFT_Llama_30Bq_7_Bloke": ModelConfig(
+        model_id="TheBloke/OpenAssistant-SFT-7-Llama-30B-GPTQ",
+        max_input_length=1024,
+        max_total_length=1792,  # seeing OOMs on 2048 on an A100 80GB
+        quantized=True,
+    ),
     "OA_SFT_Llama_30B_7e3": ModelConfig(
         model_id="OpenAssistant/oasst-sft-7e3-llama-30b",
         max_input_length=1024,

but as I haven't added custom models there before perhaps someone else has tried with the models provided here.

I did try to draw inspiration from some commits that have previously added models there like https://github.com/LAION-AI/Open-Assistant/commit/6c3519ba558e4f1aa75b859357e6ceb56eac0429 but it seems this is all they are doing. However, with the above diff I get

open-assistant-inference-worker-1  | 2023-05-18 06:17:30.477 | WARNING  | __main__:main:28 - Model config: model_id='TheBloke/OpenAssistant-SFT-7-Llama-30B-GPTQ' max_input_length=1024 max_total_length=1792 quantized=True
open-assistant-inference-worker-1  | 2023-05-18T06:17:31.692411Z ERROR shard-manager: text_generation_launcher: "Error when initializing model
open-assistant-inference-worker-1  | Traceback (most recent call last):
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
open-assistant-inference-worker-1  |     sys.exit(app())
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
open-assistant-inference-worker-1  |     return get_command(self)(*args, **kwargs)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
open-assistant-inference-worker-1  |     return self.main(*args, **kwargs)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
open-assistant-inference-worker-1  |     return _main(
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
open-assistant-inference-worker-1  |     rv = self.invoke(ctx)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
open-assistant-inference-worker-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
open-assistant-inference-worker-1  |     return ctx.invoke(self.callback, **ctx.params)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
open-assistant-inference-worker-1  |     return __callback(*args, **kwargs)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
open-assistant-inference-worker-1  |     return callback(**use_params)  # type: ignore
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
open-assistant-inference-worker-1  |     server.serve(model_id, revision, sharded, quantize, uds_path)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 130, in serve
open-assistant-inference-worker-1  |     asyncio.run(serve_inner(model_id, revision, sharded, quantize))
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
open-assistant-inference-worker-1  |     return loop.run_until_complete(main)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
open-assistant-inference-worker-1  |     self.run_forever()
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
open-assistant-inference-worker-1  |     self._run_once()
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
open-assistant-inference-worker-1  |     handle._run()
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
open-assistant-inference-worker-1  |     self._context.run(self._callback, *self._args)
open-assistant-inference-worker-1  | > File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 99, in serve_inner
open-assistant-inference-worker-1  |     model = get_model(model_id, revision, sharded, quantize)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 52, in get_model
open-assistant-inference-worker-1  |     config = AutoConfig.from_pretrained(model_id, revision=revision)
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.27.0.dev0-py3.9.egg/transformers/models/auto/configuration_auto.py\", line 882, in from_pretrained
open-assistant-inference-worker-1  |     config_class = CONFIG_MAPPING[config_dict[\"model_type\"]]
open-assistant-inference-worker-1  |   File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.27.0.dev0-py3.9.egg/transformers/models/auto/configuration_auto.py\", line 588, in __getitem__
open-assistant-inference-worker-1  |     raise KeyError(key)
open-assistant-inference-worker-1  | KeyError: 'llama'
open-assistant-inference-worker-1  | " rank=0
open-assistant-inference-worker-1  | 2023-05-18T06:17:32.251412Z ERROR text_generation_launcher: Shard 0 failed to start:
open-assistant-inference-worker-1  | Traceback (most recent call last):
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/bin/text-generation-server", line 8, in <module>
open-assistant-inference-worker-1  |     sys.exit(app())
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py", line 55, in serve
open-assistant-inference-worker-1  |     server.serve(model_id, revision, sharded, quantize, uds_path)
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py", line 130, in serve
open-assistant-inference-worker-1  |     asyncio.run(serve_inner(model_id, revision, sharded, quantize))
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py", line 44, in run
open-assistant-inference-worker-1  |     return loop.run_until_complete(main)
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
open-assistant-inference-worker-1  |     return future.result()
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py", line 99, in serve_inner
open-assistant-inference-worker-1  |     model = get_model(model_id, revision, sharded, quantize)
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 52, in get_model
open-assistant-inference-worker-1  |     config = AutoConfig.from_pretrained(model_id, revision=revision)
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.27.0.dev0-py3.9.egg/transformers/models/auto/configuration_auto.py", line 882, in from_pretrained
open-assistant-inference-worker-1  |     config_class = CONFIG_MAPPING[config_dict["model_type"]]
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  |   File "/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/transformers-4.27.0.dev0-py3.9.egg/transformers/models/auto/configuration_auto.py", line 588, in __getitem__
open-assistant-inference-worker-1  |     raise KeyError(key)
open-assistant-inference-worker-1  | 
open-assistant-inference-worker-1  | KeyError: 'llama'

@pevogam I'm afraid you can't use GPTQ models with OpenAssistant's repo. That will only work with models that can be loaded natively with transformers. That means float16 repos, like my repo here: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-HF

But for that you will need a ton of VRAM - 60+ GB for fp16, or ~30GB for int8 if you use bitsandbytes (pass load_in_8bit=True to the from_pretrained() call).

You could look into modifying the OpenAssistant code to use AutoGPTQ. That shouldn't be too hard if you know a bit of Python.

And/or, wait for the new 4bit version of bitsandbytes to come out, which will support easy 4bit quantisation in native transformers. It's still in private beta and I've not tried it yet so can't comment on how well it works. But in theory it sounds like it'll make it very easy to use float16 models in 4bit in any existing Python code.

Sign up or log in to comment