Back to home
Inference API Pricing

Our Paid inference API is an accelerated version of the API that powers our Inference widgets on every model's page (see Model Hub doc). It is accelerated on CPU – and available on GPU for enterprise users – and supports large volumes of requests.

Read the API documentation ➑️

  • Up to 10M tokens inference

    Depending on your sequence lengths, translates to up to 1 M requests for text classification, or 100k requests for generations (translation, summarization) tasks.

  • Accelerated on CPU (2x faster than inference widgets)

    Leveraging our pipelines built on optimized intermediary representations e.g. ONNX, and carefully tuned executors.

Start now
  • Unlimited tokens inference

    Use a scalable, dedicated endpoint. Just for your team!

  • Private models

  • GPU hosting

    We pick the best hardware for your models.

  • Priority support

Contact us