Inference API
Please refer to Inference API Documentation for detailed information.
What technology do you use to power the inference API?
For 🤗 Transformers models, Pipelines power the API.
On top of Pipelines
and depending on the model type, there are several production optimizations like:
- compiling models to optimized intermediary representations (e.g. ONNX),
- maintaining a Least Recently Used cache, ensuring that the most popular models are always loaded,
- scaling the underlying compute infrastructure on the fly depending on the load constraints.
For models from other libraries, the API uses Starlette and runs in Docker containers. Each library defines the implementation of different pipelines.
How can I turn off the inference API for my model?
Specify inference: false
in your model card’s metadata.
Can I send large volumes of requests? Can I get accelerated APIs?
If you are interested in accelerated inference, higher volumes of requests, or an SLA, please contact us at api-enterprise at huggingface.co
.
How can I see my usage?
You can head to the Inference API dashboard. Learn more about it in the Inference API documentation.
Is there programmatic access to the Inference API?
Yes, the huggingface_hub
library has a client wrapper documented here.