Qwen3-Coder-30B-A3B-Instruct-FP8

Qwen3-Coder-30B-A3B-Instruct-FP8

This repository contains Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 model card.

Overview

Qwen3-Coder-30B-A3B-Instruct is the streamlined coding variant of the Qwen3-Coder series, an auto-regressive Mixture-of-Experts (MoE) transformer with 30.5B total parameters of which about 3.3B are activated per token. It is optimized for agentic coding and code-related tasks — code generation, repository-scale understanding, and tool use — and operates in non-thinking mode only. Its intended use is the same as the upstream Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8, and it is released under the Apache 2.0 License.

Architecture: Qwen3-MoE (Mixture-of-Experts)
Input / Output: Text / Text
Supported Inference Engine: Furiosa LLM
Supported Hardware: FuriosaAI RNGD

Quantization

Weights are quantized to FP8 (static), following the upstream FP8 release, and activations use dynamic FP8 quantization at runtime (per-token / per-block). The KV cache stays in 16-bit precision.

Features

Tool calling. The model supports tool (function) calling through the hermes tool-call parser, the parser used by the Qwen3 series.

Parallelism Strategy

On RNGD, Qwen3-Coder-30B-A3B-Instruct-FP8 runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).

Usage

To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.

Launch the server

The simplest way to serve the model is:

# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8

When the server is ready, you will see:

INFO:     Started server process [27507]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Launch the server with tool calling

To enable tool (function) calling, start the server with the hermes tool-call parser:

furiosa-llm serve furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Query the server

The server exposes an OpenAI-compatible API. You can send a request with curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8",
    "messages": [{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."}]
    }' \
    | python -m json.tool

Tool calling

With the server launched using --enable-auto-tool-choice --tool-call-parser hermes, you can pass tools and let the model decide when to call them. See the Tool Calling guide for a complete client example and details on tool-choice options.

Learn more

Tool Calling — parsers, tool-choice options, and more examples
Furiosa-LLM Server (furiosa-llm serve) — full OpenAI-compatible API reference and serving options
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 — upstream model card