Deploy LLama 3 in a few clicks on Inference Endpoints

Machine Learning At Your Service

With Inference Endpoints (dedicated), easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

Production Inference Made Easy

Deploy models on dedicated and secure infrastructure without dealing with containers and GPUs

Deploy models with just a few clicks

Turn your models into production ready APIs, without having to deal with infrastructure or MLOps.

Keep your production costs down

Leverage a fully-managed production solution for inference and just pay as you go for the raw compute you use.

Enterprise Security

Deploy models into secure offline endpoints only accessible via direct connection to your Virtual Private Cloud (VPCs).

How It Works

Deploy models for production in a few simple steps

1. Select your model

Select the model you want to deploy. You can deploy a custom model or any of the 60,000+ Transformers, Diffusers or Sentence Transformers models available on the πŸ€— Hub for NLP, computer vision, or speech tasks.

Select your model

2. Choose your cloud

Pick your cloud and select a region close to your data in compliance with your requirements (e.g. Europe, North America or Asia Pacific).

Choose your cloud

3. Select your security level

Protected Endpoints are accessible from the Internet and require valid authentication.

Public Endpoints are accessible from the Internet and do not require authentication.

Private Endpoints are only available through an intra-region secured AWS or Azure PrivateLink direct connection to a VPC and are not accessible from the Internet.

select security level

4. Create and manage your endpoint

Click create and your new endpoint is ready in a couple of minutes. Define autoscaling, access logs and monitoring, set custom metrics routes, manage endpoints programmatically with API/CLI, and rollback models - all super easily.

Create and manage your endpoint

A Better Way to Go to Production

Scale your machine learning while keeping your costs low



Struggle with MLOps and building the right infrastructure for production.


Wasted time deploying models slows down ML development.


Deploying models in a compliant and secure way is difficult & time-consuming.


87% of data science projects never make it into production.



Don't worry about infrastructure or MLOps, spend more time building models.


A fully-managed solution for model inference accelerates your ML roadmap.


Easily deploy your models in a secure and compliant environment.


Seamless model deployment bridges the gap from research to production.

Customer Success Stories

Learn how leading AI teams use πŸ€— Inference Endpoints to deploy models

Endpoints for Music


Musixmatch is the world’s leading music data company

Use Case

Custom text embeddings generation pipeline

Models Deployed



Custom model based on sentence transformers

Portrait of Andrea Boscarino, Data Scientist at Musixmatch
The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.
Andrea Boscarino
Data Scientist at Musixmatch


Pay for CPU & GPU compute resources

πŸ› Self-serve

  • Inference Endpoints (dedicated)

    Pay for compute resources uptime by the minute, billed monthly.

    As low as $0.03 per CPU core/hr and $0.50 per GPU/hr.

  • Email Support

    Email support and no SLAs.

Deploy your first model
  • Inference Endpoints (dedicated)

    Custom pricing based on volume commit and annual contracts.

  • Dedicated Support & SLAs

    Dedicated support, 24/7 SLAs, and uptime guarantees.

Request a Quote

Start now with Inference Endpoints (dedicated)

Deploy models in a few clicks 🀯

Pay for compute resources uptime, by the minute.