Please refer to Inference API Documentation for detailed information.
For 🤗 Transformers models, Pipelines power the API.
On top of
Pipelines and depending on the model type, there are several production optimizations like:
- compiling models to optimized intermediary representations (e.g. ONNX),
- maintaining a Least Recently Used cache, ensuring that the most popular models are always loaded,
- scaling the underlying compute infrastructure on the fly depending on the load constraints.
inference: false in your model card’s metadata.
If you are interested in accelerated inference, higher volumes of requests, or an SLA, please contact us at
api-enterprise at huggingface.co.
huggingface_hub library has a client wrapper documented here.