Inference API

For detailed usage documentation, please refer to Accelerated Inference API Documentation.

What technology do you use to power the inference API?

For 🤗 Transformers models, the API is built on top of our Pipelines feature.

On top of Pipelines and depending on the model type, we build a number of production optimizations like:

  • compiling models to optimized intermediary representations (e.g. ONNX),
  • maintaining a Least Recently Used cache ensuring that the most popular models are always loaded,
  • scaling the underlying compute infrastructure on the fly depending on the load constraints.

For models from other libraries, the API uses Starlette and runs in Docker containers. Each library defines the implementation of different pipelines.

How can I turn off the inference API for my model?

Specify inference: false in your model card's metadata.

Can I send large volumes of requests? Can I get accelerated APIs?

If you are interested in accelerated inference and/or higher volumes of requests and/or a SLA, please contact us at api-enterprise at

How can I see my usage?

You can head to the Inference API dashboard. Learn more about it in the Inference API documentation.

Is there programmatic access to the Inference API?

Yes, the huggingface_hub library has a client wrapper documented here.