Inference API documentation

Detailed usage and pinned models

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Detailed usage and pinned models

API Usage dashboard

The API Usage Dashboard (beta) shows historical number of requests and input characters per model for an API Token.

Please note that each user account, and each organization, has its own API Token. By default, you should not have anything to do. However, if you have any doubt about what’s being shown to you, or you have a complex setup (user subscription, multiple organizations and so on), please contact api-entreprise@hugginface.co.

Pinned models

Model pinning is only supported for existing customers.

If you’re interested in having a model that you can readily deploy for inference, take a look at our Inference Endpoints solution! It is a secure production environment with dedicated and autoscaling infrastructure, and you have the flexibility to choose between CPU and GPU resources.

A pinned model is a model which is preloaded for inference and instantly available for requests authenticated with an API Token.

You can set pinned models to your API Token in the API Usage dashboard.

Pinned models

Model pinning is also accessible directly from the API. Here is how you see what your current pinned models are :

Python
JavaScript
cURL
import requests
api_url = "https://api-inference.huggingface.co/usage/pinned_models"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
response = requests.get(api_url, headers=headers)
# {"pinned_models": [...], "allowed_pinned_models": 5}

Pinning models is done that way.

Be careful, you need to specify ALL the pinned models each time !

Python
JavaScript
cURL
import json
import requests
api_url = "https://api-inference.huggingface.co/usage/pinned_models"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# XXX: Put ALL the models you want to pin at once, this will override
# the previous values.
data = json.dumps({"pinned_models": [{"model_id": "gpt2", "compute_type": "cpu", "replicas": 1}]})
response = requests.post(api_url, headers=headers, data=data)
# {"ok":"Pinned 1 models, please wait while we load them.","pinned_models":[{"model_id": "gpt2", "compute_type": "cpu"}]}'