TheBloke/OpenAssistant-SFT-7-Llama-30B-GPTQ

May 15, 2023

Hello,

Thanks for your continuous work to provide these models.

Could you answer a question for me?

I'm looking to deploy this model as a backend API with streaming to access via UI application. What servers can I use for this GPTQ converted models?

I'm currently using https://github.com/huggingface/text-generation-inference for regular HF models and it works well, but doesn't support GPTQ yet.

TheBloke

Owner May 15, 2023

Check out text-generation-webui. It can load these GPTQ models, and supports a simple REST API, with support for streaming, which you can query from your own Python code.

Or, if in future you want to implement your own code, keep an eye on AutoGPTQ. It's a simple transformers-like interface to loading GPTQ models. It makes it nearly as easy to load a GPTQ quantised model as it is to load a standard HF model.

It should be pretty easy to add support for GPTQ models into whatever code you have, including that HF text-generation-inference if you wanted to.

AutoGPTQ is still in active development and has a few bugs and issues. But it's making great progress and in a week or two it should be ready for mass adoption.

gsaivinay

May 16, 2023

That is awesome. Now I got multiple ideas to implement a backend API. Thanks for your response.

gsaivinay changed discussion status to closed May 16, 2023

TheBloke
/

OpenAssistant-SFT-7-Llama-30B-GPTQ

Thanks, Help needed!