Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Inference API

Please refer to Inference API Documentation for detailed information.

What technology do you use to power the inference API?

For 🤗 Transformers models, Pipelines power the API.

On top of Pipelines and depending on the model type, there are several production optimizations like:

  • compiling models to optimized intermediary representations (e.g. ONNX),
  • maintaining a Least Recently Used cache, ensuring that the most popular models are always loaded,
  • scaling the underlying compute infrastructure on the fly depending on the load constraints.

For models from other libraries, the API uses Starlette and runs in Docker containers. Each library defines the implementation of different pipelines.

How can I turn off the inference API for my model?

Specify inference: false in your model card’s metadata.

Can I send large volumes of requests? Can I get accelerated APIs?

If you are interested in accelerated inference, higher volumes of requests, or an SLA, please contact us at api-enterprise at huggingface.co.

How can I see my usage?

You can head to the Inference API dashboard. Learn more about it in the Inference API documentation.

Is there programmatic access to the Inference API?

Yes, the huggingface_hub library has a client wrapper documented here.