Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started


Let’s have a quick look at the 🤗 Accelerated Inference API.

Main features:

  • Leverage 10,000+ Transformer models (T5, Blenderbot, Bart, GPT-2, Pegasus...)
  • Upload, manage and serve your own models privately
  • Run Classification, NER, Conversational, Summarization, Translation, Question-Answering, Embeddings Extraction tasks
  • Get up to 10x inference speedup to reduce user latency
  • Accelerated inference for a number of supported models on CPU and GPU (GPU requires a Community Pro or Organization Lab plan)
  • Run large models that are challenging to deploy in production
  • Scale up to 1,000 requests per second with automatic scaling built-in
  • Ship new NLP, CV, Audio, or RL features faster as new models become available
  • Build your business on a platform powered by the reference open source project in ML

Get your API Token

To get started you need to:

You should see a token hf_xxxxx (old tokens are api_XXXXXXXX or api_org_XXXXXXX).

If you do not submit your API token when sending requests to the API, you will not be able to run inference on your private models, or benefit from the model pinning and acceleration features of the API.

Running Inference with API Requests

The first step is to choose which model you are going to run. Go to the Model Hub and select the model you want to use. If you are unsure where to start, make sure to check our recommended models for each ML task available.


Let’s use gpt2 as an example. To run inference, simply use this code:

import json
import requests
API_URL = ""
headers = {"Authorization": f"Bearer {API_TOKEN}"}
def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))
data = query("Can you please let us know more details about your ")

API Options and Parameters

Depending on the task (aka pipeline) the model is configured for, the request will accept specific parameters. When sending requests to run any model, API options allow you to specify the caching and model loading behavior, and inference on GPU (Community Pro or Organization Lab plan required) All API options and parameters are detailed here detailed_parameters.

Using CPU-Accelerated Inference (~up to 10x speedup)

As an API customer, your API token will automatically enable CPU-Accelerated inference on your requests if the model type is supported. For instance, if you compare gpt2 model inference through our API with CPU-Acceleration, compared to running inference on the model out of the box on a local setup, you should measure a ~10x speedup. The specific performance boost depends on the model and input payload (and your local hardware).

To verify you are using the CPU-Accelerated version of a model you can check the x-compute-type header of your requests, which should be cpu+optimized. If you do not see it, it simply means not all optimizations are turned on. This can be for various factors; the model might have been added recently to transformers, or the model can be optimized in several different ways and the best one depends on your use case.

If you contact us at, we’ll be able to increase the inference speed for you, depending on your actual use case.

Using GPU-Accelerated Inference

In order to use GPU-Accelerated inference, you need a Community Pro or Organization Lab plan. To run any model on a GPU, you need to specify it via an option in your request:

{"inputs": "...REGULAR INPUT...", "options": {"use_gpu": true}}

Using GPU-Accelerated inference should produce a significant speedup for all models.

To verify you are using the GPU-Accelerated version of the model, you can check the x-compute-type header of your requests, which should be gpu.

Please note: Contact us at to discuss your use case and usage profile when running GPU-Accelerated inference on many models or large models, so we can optimize the infrastructure accordingly.

Using Large Models (>10 GB)

Large models do not get loaded automatically to protect quality of service. Contact us at so we can configure large models for your endpoints.

Model Pinning / Preloading

With over 60,000 models available in the Model Hub, not all can be loaded in compute memory to be instantly available for inference. To guarantee model availability for API customers who integrate them in production applications, we offer to pin frequently used model(s) to their API endpoints, so these models are always instantly available for inference.

The number of models that can be pinned depends on the selected API plan. To get a model pinned to your account, please contact us at