Inference Providers documentation

Responses API (beta)

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Responses API (beta)

The Responses API (from OpenAI) provides a unified interface for model interactions with Hugging Face Inference Providers. Use your existing OpenAI SDKs to access features like multi-provider routing, event streaming, structured outputs, and Remote MCP tools.

This guide assumes you have a Hugging Face account and access token. You can create a free account at huggingface.co and get your token from your settings page.

Why build with the Responses API?

The Responses API provides a unified interface built for agentic apps. With it, you get:

  • Built-in tool orchestration. Invoke functions, server-side MCP tools, and schema-validated outputs without changing endpoints.
  • Event-driven streaming. Receive semantic events such as response.created, output_text.delta, and response.completed to power incremental UIs.
  • Reasoning controls and structured outputs. Dial up or down reasoning effort and require models to return schema-compliant JSON every time.

Prerequisites

  • A Hugging Face account with remaining Inference Providers credits (free tier available).
  • A fine-grained Hugging Face token with “Make calls to Inference Providers” permission stored in HF_TOKEN.

All Inference Providers chat completion models should be compatible with the Responses API. You can browse available models on the Inference Models page.

Configure your Responses client

Install the OpenAI SDK for your language of choice before running the snippets below (pip install openai for Python or npm install openai for Node.js). If you prefer issuing raw HTTP calls, any standard tool such as curl will work as well.

python
typescript
curl
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="openai/gpt-oss-120b:groq",
    instructions="You are a helpful assistant.",
    input="Tell me a three-sentence bedtime story about a unicorn.",
)

print(response.output_text)

If you plan to use a specific provider, append it to the model id as <repo>:<provider> (for example moonshotai/Kimi-K2-Instruct-0905:groq). Otherwise, omit the suffix and let routing fall back to the default provider.

Core Response patterns

Plain text output

For a single response message, pass a string as input. The Responses API returns both the full response object and a convenience output_text helper.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="moonshotai/Kimi-K2-Instruct-0905:groq",
    instructions="You are a helpful assistant.",
    input="Tell me a three sentence bedtime story about a unicorn.",
)

print(response.output_text)

Multimodal inputs

Mix text and vision content by passing a list of content parts. The Responses API unifies text and images into a single input array.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "what is in this image?"},
                {
                    "type": "input_image",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            ],
        }
    ],
)

print(response.output_text)

Multi-turn conversations

Responses requests accept conversation history. Add developer, system, and user messages to control the assistant’s behavior without managing chat state yourself.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="moonshotai/Kimi-K2-Instruct-0905:groq",
    input=[
        {"role": "developer", "content": "Talk like a pirate."},
        {"role": "user", "content": "Are semicolons optional in JavaScript?"},
    ],
)

print(response.output_text)

Advanced features

Advanced features use the same request format.

Event-based streaming

Set stream=True to receive incremental response.* events. Each event arrives as JSON, so you can render words as they stream in or monitor tool execution in real time.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

stream = client.responses.create(
    model="moonshotai/Kimi-K2-Instruct-0905:groq",
    input=[{"role": "user", "content": "Say 'double bubble bath' ten times fast."}],
    stream=True,
)

for event in stream:
    print(event)

Tool calling and routing

Add a tools array to let the model call your functions. The router handles the function calls and returns tool events.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

tools = [
    {
        "type": "function",
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location", "unit"],
        },
    }
]

response = client.responses.create(
    model="moonshotai/Kimi-K2-Instruct-0905:groq",
    tools=tools,
    input="What is the weather like in Boston today?",
    tool_choice="auto",
)

print(response)

Structured outputs

Force the model to return JSON matching a schema by supplying a response_format. The Python SDK exposes a .parse helper that converts the response directly into your target type.

When calling openai/gpt-oss-120b:groq from JavaScript or raw HTTP, include a brief instruction to return JSON. Without it the model may emit markdown even when a schema is provided.

python
typescript
curl
from openai import OpenAI
from pydantic import BaseModel
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

response = client.responses.parse(
    model="openai/gpt-oss-120b:groq",
    input=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    text_format=CalendarEvent,
)

print(response.output_parsed)

Remote MCP execution

Remote MCP lets you call server-hosted tools that implement the Model Context Protocol. Provide the MCP server URL and allowed tools, and the Responses API handles the calls for you.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="moonshotai/Kimi-K2-Instruct-0905:groq",
    input="how does tiktoken work?",
    tools=[
        {
            "type": "mcp",
            "server_label": "gitmcp",
            "server_url": "https://gitmcp.io/openai/tiktoken",
            "allowed_tools": ["search_tiktoken_documentation", "fetch_tiktoken_documentation"],
            "require_approval": "never",
        },
    ],
)

for output in response.output:
    print(output)

Reasoning effort controls

Some open-source reasoning models expose effort tiers. Pass reasoning={"effort": "low" | "medium" | "high"} to trade off latency and depth.

python
typescript
curl
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="openai/gpt-oss-120b:groq",
    instructions="You are a helpful assistant.",
    input="Say hello to the world.",
    reasoning={"effort": "low"},
)

for i, item in enumerate(response.output):
    print(f"Output #{i}: {item.type}", item.content)

API reference

Read the official OpenAI Responses reference.

Update on GitHub