How to configure with TGI?

#34

by luissimoes - opened Nov 13, 2023

Nov 13, 2023

I am currently new to LLMs and its deployment model...
But I want to know if its possible to use this with TGI (text-generation-inference) in order to leverage APIs.

Thank you

TheBloke

Owner Nov 13, 2023

Not this GGML variant, which is old and not recommended to use with anything now. GGML has been superseded by GGUF, but that is also not compatible with TGI. GGML/GGUF is a special format designed for use on smaller hardware, including without a GPU, but isn't supported by inference servers like TGI or vLLM. There are inference servers that support GGUF, like text-generation-webui, but that's more for single-user operation, not multi-user like TGI and vLLM.

Check out my Llama-2-7B-Chat-GPTQ or Llama-2-7B-Chat-AWQ repos, they both work with TGI.

luissimoes

Nov 13, 2023

•

edited Nov 13, 2023

Thank you very much for the clarification.

Do those run on TGI and using CPU? Are they quantized? I wanted to have same quality as 7B.

TheBloke

Owner Nov 13, 2023

•

edited Nov 13, 2023

All my uploads are quantised.

Yes the GPTQ and AWQ models run on TGI, but they are GPU only. TGI is GPU only.

If you want CPU then you do want the GGUF models, but you can't use TGI. There are various options for providing an API using GGUF models, like text-generation-webui's API. I'd recommend using that.

luissimoes

Nov 13, 2023

Thanks!

I am doing some experiments using the GGML version now.
But struggling to understand why the approach using AutoModelForCausalLM from ctransformers gives a different result than using CTransformers from LangChain... Same prompt different results.

AutoModelForCausalLM is more close to what I am looking for, just need to figure out what causes the difference...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment