text-embeddings-inference documentation

Quick Tour

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quick Tour

Text Embeddings

The easiest way to get started with TEI is to use one of the official Docker containers (see Supported models and hardware to choose the right container).

After making sure that your hardware is supported, install the NVIDIA Container Toolkit if you plan on utilizing GPUs. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.

Next, install Docker following their installation instructions.

Finally, deploy your model. Let’s say you want to use BAAI/bge-large-en-v1.5. Here’s how you can do this:

model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision

Here we pass a revision=refs/pr/5 because the safetensors variant of this model is currently in a pull request. We also recommend sharing a volume with the Docker container (volume=$PWD/data) to avoid downloading weights every run.

Once you have deployed a model, you can use the embed endpoint by sending requests:

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Re-rankers

Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity between a query and a text.

See this blogpost by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance.

Let’s say you want to use BAAI/bge-reranker-large:

model=BAAI/bge-reranker-large
revision=refs/pr/4
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model --revision $revision

Once you have deployed a model, you can use the rerank endpoint to rank the similarity between a query and a list of texts:

curl 127.0.0.1:8080/rerank \
    -X POST \
    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
    -H 'Content-Type: application/json'

Sequence Classification

You can also use classic Sequence Classification models like SamLowe/roberta-base-go_emotions:

model=SamLowe/roberta-base-go_emotions
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.2 --model-id $model

Once you have deployed the model you can use the predict endpoint to get the emotions most associated with an input:

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":"I like you."}' \
    -H 'Content-Type: application/json'

Batching

You can send multiple inputs in a batch. For example, for embeddings

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":["Today is a nice day", "I like you"]}' \
    -H 'Content-Type: application/json'

And for Sequence Classification:

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
    -H 'Content-Type: application/json'