Our Paid inference API is an accelerated version of the API that powers our Inference widgets on every model's page (see Model Hub doc). It is accelerated on CPU – and available on GPU for enterprise users – and supports large volumes of requests.
Up to 10M tokens inference
Depending on your sequence lengths, translates to up to 1 M requests for text classification, or 100k requests for generations (translation, summarization) tasks.
Accelerated on CPU (2x faster than inference widgets)
Leveraging our pipelines built on optimized intermediary representations e.g. ONNX, and carefully tuned executors.
Unlimited tokens inference
Use a scalable, dedicated endpoint. Just for your team!
We pick the best hardware for your models.