Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
MoritzLaurer 
posted an update Sep 23
Post
2026
The new NIM Serverless API by HF and Nvidia is a great option if you want a reliable API for open-weight LLMs like Llama-3.1-405B that are too expensive to run on your own hardware.

- It's pay-as-you-go, so it doesn't have rate limits like the standard HF Serverless API and you don't need to commit to hardware like for a dedicated endpoint.
- It works out-of-the box with the new v0.25 release of our huggingface_hub.InferenceClient
- It's specifically tailored to a small collection of popular open-weight models. For a broader selection of open models, we recommend using the standard HF Serverless API.
- Note that you need a token from an Enterprise Hub organization to use it.

Details in this blog post: https://huggingface.co/blog/inference-dgx-cloud
Compatible models in this HF collection: nvidia/nim-serverless-inference-api-66a3c6fcdcb5bbc6e975b508
Release notes with many more features of huggingface_hub==0.25.0: https://github.com/huggingface/huggingface_hub/releases/tag/v0.25.0

Copy-pasteable code in the first comment:
#!pip install "huggingface_hub>=0.25.0"
from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="MY_FINEGRAINED_ENTERPRISE_ORG_TOKEN"  # see docs: https://huggingface.co/blog/inference-dgx-cloud#create-a-fine-grained-token
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    max_tokens=1024,
)

print(output)

Very exciting to see this! I often want to use an LLM for a short period, and setting up a whole endpoint for this can be overkill. This seems like a very neat solution!

Do you think there is a chance that any VLMs will be added soon!?