text-generation-inference documentation

TensorRT-LLM backend

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

TensorRT-LLM backend

The NVIDIA TensorRT-LLM (TRTLLM) backend is a high-performance backend for LLMs that uses NVIDIA’s TensorRT library for inference acceleration. It makes use of specific optimizations for NVIDIA GPUs, such as custom kernels.

To use the TRTLLM backend you need to compile engines for the models you want to use. Each engine must be compiled on the same GPU architecture that you will use for inference.

Supported models

Check the support matrix to see which models are supported.

Compiling engines

You can use Optimum-NVIDIA to compile engines for the models you want to use.

MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"

# Install huggingface_cli
python -m pip install huggingface-cli[hf_transfer]

# Login to the Hugging Face Hub
huggingface-cli login

# Create a directory to store the model
mkdir -p /tmp/models/$MODEL_NAME

# Create a directory to store the compiled engine
mkdir -p /tmp/engines/$MODEL_NAME

# Download the model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --local-dir /tmp/models/$MODEL_NAME $MODEL_NAME

# Compile the engine using Optimum-NVIDIA
docker run \
  --rm \
  -it \
  --gpus=1 \
  -v /tmp/models/$MODEL_NAME:/model \
  -v /tmp/engines/$MODEL_NAME:/engine \
  huggingface/optimum-nvidia \
    optimum-cli export trtllm \
    --tp=1 \
    --pp=1 \
    --max-batch-size=128 \
    --max-input-length 4096 \
    --max-output-length 8192 \
    --max-beams-width=1 \
    --destination /engine \
    $MODEL_NAME

Your compiled engine will be saved in the /tmp/engines/$MODEL_NAME directory.

Using the TRTLLM backend

Run TGI-TRTLLM Docker image with the compiled engine:

docker run \
  --gpus 1 \
  -it \
  --rm \
  -p 3000:3000 \
  -e MODEL=$MODEL_NAME \
  -e PORT=3000 \
  -e HF_TOKEN='hf_XXX' \
  -v /tmp/engines/$MODEL_NAME:/data \
  ghcr.io/huggingface/text-generation-inference:latest-trtllm \
  --executor-worker executorWorker \
  --model-id /data/$MODEL_NAME

Development

To develop TRTLLM backend, you can use dev containers located in .devcontainer directory.

< > Update on GitHub