Quick Tour
The easiest way of getting started is using the official Docker container. Install Docker following their installation instructions.
Launching TGI
Let’s say you want to deploy teknium/OpenHermes-2.5-Mistral-7B model with TGI on an Nvidia GPU. Here is an example on how to do that:
model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.0.1 \
--model-id $model
If you want to serve gated or private models, please refer to this guide for detailed instructions.
Supported hardware
TGI supports various hardware. Make sure to check the Using TGI with Nvidia GPUs, Using TGI with AMD GPUs, Using TGI with Intel GPUs, Using TGI with Gaudi, Using TGI with Inferentia guides depending on which hardware you would like to deploy TGI on.
Consuming TGI
Once TGI is running, you can use the generate
endpoint or the Open AI Chat Completion API compatible Messages API by doing requests. To learn more about how to query the endpoints, check the Consuming TGI section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
import requests
headers = {
"Content-Type": "application/json",
}
data = {
'inputs': 'What is Deep Learning?',
'parameters': {
'max_new_tokens': 20,
},
}
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())
# {'generated_text': '\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can'}
To see all possible deploy flags and options, you can use the --help
flag. It’s possible to configure the number of shards, quantization, generation parameters, and more.
docker run ghcr.io/huggingface/text-generation-inference:3.0.1 --help