Transformers documentation

Serving

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.49.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Serving

Transformer models can be served for inference with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers.

TGI

TGI can serve models that aren’t natively implemented by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren’t available in the Transformers implementation, but other features like continuous batching and streaming are still supported.

Refer to the Non-core model serving guide for more details.

Serve a Transformers implementation the same way you’d serve a TGI model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2

Add --trust-remote_code to the command to serve a custom Transformers model.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code

vLLM

vLLM can also serve a Transformers implementation of a model if it isn’t natively implemented in vLLM.

Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.

Refer to the Transformers fallback section for more details.

By default, vLLM serves the native implementation and if it doesn’t exist, it falls back on the Transformers implementation. But you can also set --model-impl transformers to explicitly use the Transformers model implementation.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers \

Add the trust-remote-code parameter to enable loading a remote code model.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
    --task generate \
    --model-impl transformers \
    --trust-remote-code \
< > Update on GitHub