Thanks, Help needed!

#10
by gsaivinay - opened

Hello,

Thanks for your continuous work to provide these models.

Could you answer a question for me?

I'm looking to deploy this model as a backend API with streaming to access via UI application. What servers can I use for this GPTQ converted models?

I'm currently using https://github.com/huggingface/text-generation-inference for regular HF models and it works well, but doesn't support GPTQ yet.

Check out text-generation-webui. It can load these GPTQ models, and supports a simple REST API, with support for streaming, which you can query from your own Python code.

Or, if in future you want to implement your own code, keep an eye on AutoGPTQ. It's a simple transformers-like interface to loading GPTQ models. It makes it nearly as easy to load a GPTQ quantised model as it is to load a standard HF model.

It should be pretty easy to add support for GPTQ models into whatever code you have, including that HF text-generation-inference if you wanted to.

AutoGPTQ is still in active development and has a few bugs and issues. But it's making great progress and in a week or two it should be ready for mass adoption.

That is awesome. Now I got multiple ideas to implement a backend API. Thanks for your response.

gsaivinay changed discussion status to closed

Sign up or log in to comment