For detailed usage documentation, please refer to Accelerated Inference API Documentation.
For 🤗 Transformers models, the API is built on top of our Pipelines feature.
On top of
Pipelines and depending on the model type, we build a number of production optimizations like:
- compiling models to optimized intermediary representations (e.g. ONNX),
- maintaining a Least Recently Used cache ensuring that the most popular models are always loaded,
- scaling the underlying compute infrastructure on the fly depending on the load constraints.
inference: false in your model card's metadata.
If you are interested in accelerated inference and/or higher volumes of requests and/or a SLA, please contact us at
api-enterprise at huggingface.co.
huggingface_hub library has a client wrapper documented here.