Qwen3-Coder-30B-A3B-Instruct-FP8
This repository contains Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 model card.
Overview
Qwen3-Coder-30B-A3B-Instruct is the streamlined coding variant of the Qwen3-Coder series, an auto-regressive Mixture-of-Experts (MoE) transformer with 30.5B total parameters of which about 3.3B are activated per token. It is optimized for agentic coding and code-related tasks — code generation, repository-scale understanding, and tool use — and operates in non-thinking mode only. Its intended use is the same as the upstream Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8, and it is released under the Apache 2.0 License.
- Architecture: Qwen3-MoE (Mixture-of-Experts)
- Input / Output: Text / Text
- Supported Inference Engine: Furiosa LLM
- Supported Hardware: FuriosaAI RNGD
Quantization
Weights are quantized to FP8 (static), following the upstream FP8 release, and activations use dynamic FP8 quantization at runtime (per-token / per-block). The KV cache stays in 16-bit precision.
Features
- Tool calling. The model supports tool (function) calling through the
hermestool-call parser, the parser used by the Qwen3 series.
Parallelism Strategy
On RNGD, Qwen3-Coder-30B-A3B-Instruct-FP8 runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).
Usage
To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.
Launch the server
The simplest way to serve the model is:
# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8
When the server is ready, you will see:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Launch the server with tool calling
To enable tool (function) calling, start the server with the hermes tool-call
parser:
furiosa-llm serve furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Query the server
The server exposes an OpenAI-compatible API. You can send a request with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8",
"messages": [{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."}]
}' \
| python -m json.tool
Tool calling
With the server launched using --enable-auto-tool-choice --tool-call-parser hermes,
you can pass tools and let the model decide when to call them. See the
Tool Calling guide
for a complete client example and details on tool-choice options.
Learn more
- Tool Calling — parsers, tool-choice options, and more examples
- Furiosa-LLM Server (
furiosa-llm serve) — full OpenAI-compatible API reference and serving options - Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 — upstream model card
- Downloads last month
- 1
Model tree for furiosa-ai/Qwen3-Coder-30B-A3B-Instruct-FP8
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8