Quick Tour
Text Embeddings
The easiest way to get started with TEI is to use one of the official Docker containers (see Supported models and hardware to choose the right container).
After making sure that your hardware is supported, install the NVIDIA Container Toolkit if you plan on utilizing GPUs. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
Next, install Docker following their installation instructions.
Finally, deploy your model. Let’s say you want to use BAAI/bge-large-en-v1.5
. Here’s how you can do this:
model=BAAI/bge-large-en-v1.5 revision=refs/pr/5 volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision
Here we pass a revision=refs/pr/5
because the safetensors
variant of this model is currently in a pull request.
We also recommend sharing a volume with the Docker container (volume=$PWD/data
) to avoid downloading weights every run.
Once you have deployed a model, you can use the embed
endpoint by sending requests:
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Re-rankers
Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity between a query and a text.
See this blogpost by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance.
Let’s say you want to use BAAI/bge-reranker-large
:
model=BAAI/bge-reranker-large revision=refs/pr/4 volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision
Once you have deployed a model, you can use the rerank
endpoint to rank the similarity between a query and a list
of texts:
curl 127.0.0.1:8080/rerank \
-X POST \
-d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
-H 'Content-Type: application/json'
Sequence Classification
You can also use classic Sequence Classification models like SamLowe/roberta-base-go_emotions
:
model=SamLowe/roberta-base-go_emotions volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model
Once you have deployed the model you can use the predict
endpoint to get the emotions most associated with an input:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":"I like you."}' \
-H 'Content-Type: application/json'
Batching
You can send multiple inputs in a batch. For example, for embeddings
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":["Today is a nice day", "I like you"]}' \
-H 'Content-Type: application/json'
And for Sequence Classification:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
-H 'Content-Type: application/json'