Responses API (beta)

The Responses API (from OpenAI) provides a unified interface for model interactions with Hugging Face Inference Providers. Use your existing OpenAI SDKs to access features like multi-provider routing, event streaming, structured outputs, and Remote MCP tools.

This guide assumes you have a Hugging Face account and access token. You can create a free account at huggingface.co and get your token from your settings page.

Why build with the Responses API?

The Responses API provides a unified interface built for agentic apps. With it, you get:

Built-in tool orchestration. Invoke functions, server-side MCP tools, and schema-validated outputs without changing endpoints.
Event-driven streaming. Receive semantic events such as response.created, output_text.delta, and response.completed to power incremental UIs.
Reasoning controls and structured outputs. Dial up or down reasoning effort and require models to return schema-compliant JSON every time.

Prerequisites

A Hugging Face account with remaining Inference Providers credits (free tier available).
A fine-grained Hugging Face token with “Make calls to Inference Providers” permission stored in HF_TOKEN.

All Inference Providers chat completion models should be compatible with the Responses API. You can browse available models on the Inference Models page.

Configure your Responses client

Install the OpenAI SDK for your language of choice before running the snippets below (pip install openai for Python or npm install openai for Node.js). If you prefer issuing raw HTTP calls, any standard tool such as curl will work as well.

python

typescript

curl

If you plan to use a specific provider, append it to the model id as <repo>:<provider> (for example moonshotai/Kimi-K2-Instruct-0905:groq). Otherwise, omit the suffix and let routing fall back to the default provider.

Core Response patterns

Plain text output

For a single response message, pass a string as input. The Responses API returns both the full response object and a convenience output_text helper.

python

typescript

curl

Multimodal inputs

Mix text and vision content by passing a list of content parts. The Responses API unifies text and images into a single input array.

python

typescript

curl

Multi-turn conversations

Responses requests accept conversation history. Add developer, system, and user messages to control the assistant’s behavior without managing chat state yourself.

python

typescript

curl

Advanced features

Advanced features use the same request format.

Event-based streaming

Set stream=True to receive incremental response.* events. Each event arrives as JSON, so you can render words as they stream in or monitor tool execution in real time.

python

typescript

curl

Tool calling and routing

Add a tools array to let the model call your functions. The router handles the function calls and returns tool events.

python

typescript

curl

Structured outputs

Force the model to return JSON matching a schema by supplying a response_format. The Python SDK exposes a .parse helper that converts the response directly into your target type.

When calling openai/gpt-oss-120b:groq from JavaScript or raw HTTP, include a brief instruction to return JSON. Without it the model may emit markdown even when a schema is provided.

python

typescript

curl

Remote MCP execution

Remote MCP lets you call server-hosted tools that implement the Model Context Protocol. Provide the MCP server URL and allowed tools, and the Responses API handles the calls for you.

python

typescript

curl

Reasoning effort controls

Some open-source reasoning models expose effort tiers. Pass reasoning={"effort": "low" | "medium" | "high"} to trade off latency and depth.

python

typescript

curl

API reference

Read the official OpenAI Responses reference.

Update on GitHub