text-embeddings-inference documentation

Text Embeddings Inference

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Text Embeddings Inference

Text Embeddings Inference (TEI) is a comprehensive toolkit designed for efficient deployment and serving of open source text embeddings models. It enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5.

TEI offers multiple features tailored to optimize the deployment process and enhance overall performance.

Key Features:

  • Streamlined Deployment: TEI eliminates the need for a model graph compilation step for a more efficient deployment process.
  • Efficient Resource Utilization: Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities.
  • Dynamic Batching: TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference.
  • Optimized Inference: TEI leverages Flash Attention, Candle, and cuBLASLt by using optimized transformers code for inference.
  • Safetensors weight loading: TEI loads Safetensors weights to enable tensor parallelism.
  • Production-Ready: TEI supports distributed tracing through Open Telemetry and Prometheus metrics.


Benchmark for BAAI/bge-base-en-v1.5 on an NVIDIA A10 with a sequence length of 512 tokens:

Latency comparison for batch size of 1 Throughput comparison for batch size of 1

Latency comparison for batch size of 32 Throughput comparison for batch size of 32

Getting Started:

To start using TEI, check the Quick Tour guide.