Instructions to use LLMWildling/gpt-oss-160b-kiwi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLMWildling/gpt-oss-160b-kiwi with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLMWildling/gpt-oss-160b-kiwi")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LLMWildling/gpt-oss-160b-kiwi")
model = AutoModelForCausalLM.from_pretrained("LLMWildling/gpt-oss-160b-kiwi")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use LLMWildling/gpt-oss-160b-kiwi with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLMWildling/gpt-oss-160b-kiwi"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-160b-kiwi",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LLMWildling/gpt-oss-160b-kiwi

SGLang

How to use LLMWildling/gpt-oss-160b-kiwi with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLMWildling/gpt-oss-160b-kiwi" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-160b-kiwi",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLMWildling/gpt-oss-160b-kiwi" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-160b-kiwi",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LLMWildling/gpt-oss-160b-kiwi with Docker Model Runner:
```
docker model run hf.co/LLMWildling/gpt-oss-160b-kiwi
```

gpt-oss-160b-kiwi

gpt-oss-160b-kiwi is an agentic coder version of GPT-OSS 120B.

After a bunch of iterations in a recursive coding-agent harness, this is the end result of the current 160B branch. It expands the 120B base with 48 added specialist experts and is intended to run at 12 active experts per token.

This model was trained on a 2-GPU setup.

This is by far one of the best checkpoints from this project so far. The next 180B line should be better, but Kiwi is the current strong agentic-coder release.

Overview

Base model: openai/gpt-oss-120b
Total expert rows: 176
Added specialist experts: 48
Format: MXFP4
Recommended active experts: top-k=12
Intended use: coding, agentic coding, SWE-style workflows, tool-using automation
Status: research preview

Recommended vLLM

This model was tested with vLLM using the GPT-OSS reasoning and OpenAI tool-call parsers.

vllm serve /path/to/model \
  --served-model-name vllm/doobee \
  --tensor-parallel-size 2 \
  --max-model-len 60000 \
  --gpu-memory-utilization 0.88 \
  --enforce-eager \
  --trust-remote-code \
  --reasoning-parser openai_gptoss \
  --tool-call-parser openai \
  --enable-auto-tool-choice \
  --hf-overrides '{"num_experts_per_tok": 12}'

Recommended parameters:

num_experts_per_tok=12
tensor-parallel-size=2
max-model-len=60000
gpu-memory-utilization=0.88
reasoning-parser=openai_gptoss
tool-call-parser=openai
enable-auto-tool-choice

The staged config is already set to num_experts_per_tok=12 and experts_per_token=12. If your runtime ignores those fields, pass the --hf-overrides value explicitly.

Tool Calling

Kiwi was primarily built and tested as an agentic coding model.

Recommended temperatures:

0.0 for deterministic tool use
0.3 for steady coding-agent work
0.6 for more flexible agentic exploration

The recommended serving path is OpenAI-compatible Chat Completions with vLLM's GPT-OSS reasoning parser and OpenAI tool-call parser enabled.

Kiwi is the current strong 160B agentic-coder checkpoint. The upcoming 180B line is expected to push this further.

License

Replace the placeholder license: other metadata with the actual license you want to publish under after confirming compatibility with the base model and your added weights.

Downloads last month: 35

Safetensors

Model size

165B params

Tensor type

BF16

F32

Model tree for LLMWildling/gpt-oss-160b-kiwi

Base model

openai/gpt-oss-120b

Quantized

(107)

this model