Inference API logo

Use 25k+ models via simple API calls

Over 25,000 state of the art models deployed for inference via simple API calls, with up to 100x speedup, and scalability built-in.

Token Classification
Examples
Examples
This model can be loaded on the Inference API on-demand.
Join leading AI organizations already on Hugging Face
Google logo
Elastic logo
Salesforce logo
Writer Logo
Grammarly Logo
Arrow down

Plug & Play Machine Learning

Serve in production a wide variety of machine learning tasks

Natural Language Processing Tasks

Text generation, text classification, token classification, zero-shot classification, feature extraction, NER, translation, summarization, conversational, question answering, table question answering, text2text generation and fill mask.

Natural Language Processing Tasks

Audio Tasks

Automatic speech recognition (ASR) and audio classification.

Audio Tasks

Computer Vision Tasks

Object detection and image segmentation.

Computer Vision Tasks

How Does It Work?

State of the Art as easy as HTTP requests

huggingface@transformers:~
import requests

def query(payload, model_id, api_token):
	headers = {"Authorization": f"Bearer {api_token}"}
	API_URL = f"https://api-inference.huggingface.co/models/{model_id}"
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

model_id = "distilbert-base-uncased"
api_token = "api_XXXXXXXX" # get yours at hf.co/settings/tokens
data = query("The goal of life is [MASK].", model_id, api_token)

Monitor usage and costs in your API dashboard

The Accelerated API Inference Dashboard

Fully-hosted API for AI

Up and running in minutes

+25,000 state-of-the-art models

+25,000 state-of-the-art models

Instantly integrate AI models, deployed for inference via simple API calls.
Wide variety of machine learning tasks

Wide variety of machine learning tasks

We support a broad range of NLP, audio, and vision tasks, including sentiment analysis, text generation, speech recognition, object detection and more!
Production ready

Production ready

We have built the most robust, secure and efficient AI infrastructure to handle production level loads with unmatched performance and reliability.
Real-time inferences

Real-time inferences

We optimize and accelerate our models to serve predictions 100x faster, with the latency required for real-time applications.
Scalability

Scalability

The Lab plan can support up to 1,000 requests per second. Need more? Enterprise plans offer dedicated resources for extra scalability.
SLAs

SLAs

Production level support and 24/7 SLAs are available through our enterprise plans.

Why Inference API?

Implement and iterate in no time

Implement and iterate in no time

Leverage the largest and most diverse library of models for NLP, audio and computer vision to easily build machine learning powered applications in minutes.

Stay on the cutting edge of AI

Stay on the cutting edge of AI

Seamlessly upgrade to a new model so you're always up to date with the state of the art.

Focus on building

Focus on building

Stop worrying about infrastructure. We take care of models' performance and reliability at scale. Run models in milliseconds with just a few lines of code.

Let us do the machine learning

Let us do the machine learning

Harness the power of AI while staying out of data science and MLOps. The Inference API democratizes machine learning to all engineering teams.

Pricing

Usage based pricing that reflects your business needs

🧪 Lab Plan

Pay as you go
  • Accelerated Inference API

    Text tasks: $10 (CPU) or $50 (GPU) per million input characters

    Audio tasks: $0.0004 (CPU) or $0.002 (GPU) per second processed

  • Pin models for instant availability

    $1/day/model on CPU, $5/day/model on GPU

  • Support

    Email support and no SLAs

  • Infrastructure

    Shared resources, no auto-scaling, standard latency

Get started
Custom quote
  • Accelerated Inference API

    Custom pricing based on volume commit

    Starts at $2k/mo, annual contracts

  • Pin models for instant availability

    Custom pricing based on the number of pinned models

  • Support

    Production level support, 24/7 SLAs and uptime guarantees

  • Infrastructure

    Auto-scaling, dedicated resources to achieve desired latency, and support large models

Contact us

Frequently Asked Questions

What’s the latency?
We accelerate our models on CPU and GPU so your apps work faster. Read up on how we achieved 100x speedup on Transformers.
Is my data secure?
All data transfers are encrypted in transit with SSL. Hugging Face protects your inference data - no third-party access. Enterprise plans offer additional layers of security for log-less requests.
What is your uptime?
Check out our status page to learn more about our uptime and follow status updates on any identified performance issues.
Do you offer SLAs?
For the Lab plan, there is no service-level agreement (SLA) on support response times. However, enterprise plans include an SLA on support response times and uptime guarantees.
Does it support large models?
Large models (>10gb) require dedicated infrastructure and maintenance to work reliably, we can support this via an enterprise plan with yearly commitment.
What’s your support email?
For customer support and general inquiries about Inference API, please contact us at api-enterprise@huggingface.co.