How to configure with TGI?

#34
by luissimoes - opened

I am currently new to LLMs and its deployment model...
But I want to know if its possible to use this with TGI (text-generation-inference) in order to leverage APIs.

Thank you

Not this GGML variant, which is old and not recommended to use with anything now. GGML has been superseded by GGUF, but that is also not compatible with TGI. GGML/GGUF is a special format designed for use on smaller hardware, including without a GPU, but isn't supported by inference servers like TGI or vLLM. There are inference servers that support GGUF, like text-generation-webui, but that's more for single-user operation, not multi-user like TGI and vLLM.

Check out my Llama-2-7B-Chat-GPTQ or Llama-2-7B-Chat-AWQ repos, they both work with TGI.

Thank you very much for the clarification.

Do those run on TGI and using CPU? Are they quantized? I wanted to have same quality as 7B.

All my uploads are quantised.

Yes the GPTQ and AWQ models run on TGI, but they are GPU only. TGI is GPU only.

If you want CPU then you do want the GGUF models, but you can't use TGI. There are various options for providing an API using GGUF models, like text-generation-webui's API. I'd recommend using that.

Thanks!

I am doing some experiments using the GGML version now.
But struggling to understand why the approach using AutoModelForCausalLM from ctransformers gives a different result than using CTransformers from LangChain... Same prompt different results.

AutoModelForCausalLM is more close to what I am looking for, just need to figure out what causes the difference...

Sign up or log in to comment