Qwen3-0.6B on AXERA NPU

Ready-to-run deployment package for Qwen/Qwen3-0.6B on AX637.

This release packages the AX637 text-only axllm runtime files.
The package includes the bundled axllm binary, compiled text .axmodel files, tokenizer data, and runtime configs.
The validated runtime layout uses prefill_len=128, kv_cache_len=1023, and prefill_max_token_num=768.
This package supports text-only chat through axllm run and the OpenAI-compatible axllm serve API.

Supported Platform

AX637

Validated Devices

This package has been validated on the following device class:

AX637 development board

Performance

All measurements below were taken on AX637 with the packaged binary launched from the package root.

TTFT stands for time to first token.

The table below reports steady-state measurements and excludes the first request after startup.

For this Qwen3 package, the benchmark prompts use the model's public /no_think instruction so that the measured outputs reflect direct answers or direct generation rather than a variable-length reasoning preamble.

Output tokens below are model-token counts observed in the board-side runtime log.

Scenario	Input tokens	Output tokens	Prefill chunks	TTFT	Decode
Short-answer text request (`/no_think` + city-name answer)	`28`	`5`	`1 x 128`	`234.47 ms`	`6.50 token/s avg`
Long-output text generation (`/no_think` + number sequence request)	`30`	`256`	`1 x 128`	`233.89 ms`	`8.13 token/s avg`

For sustained decode throughput, use the long-output row as the more representative reference because the short-answer row terminates after only a few generated tokens.

The packaged runtime uses the following context layout:

prefill_len=128
kv_cache_len=1023
prefill_max_token_num=768

Startup Runtime Footprint

Item	Value
`Package flash total`	`1.30 GiB`
`Runtime CMM increment during board-side startup`	`774 MB`

The CMM value above was measured from the validated AX637 board-side startup log, where remain_cmm decreased from 2005 MB after the first text layer init to 1231 MB after post-model init.

Package Layout

.
├── README.md
├── bin/
│   ├── axllm
│   └── axllm.version.json
├── config.json
├── post_config.json
├── qwen3_tokenizer.txt
├── model.embed_tokens.weight.bfloat16.bin
├── qwen3_p128_l0_together.axmodel
├── ...
├── qwen3_p128_l27_together.axmodel
└── qwen3_post.axmodel

This package uses a flat runtime layout. The bundled axllm binary lives under bin/, and the compiled text runtime files live at the repository root.

Direct Inference with `axllm`

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/Qwen3-0.6B-AX637
cd AXERA-TECH/Qwen3-0.6B-AX637
hf download AXERA-TECH/Qwen3-0.6B-AX637 --local-dir .

Install `axllm`

Option 1: use the bundled binary in this repository.

chmod +x ./bin/axllm

If your shell does not already expose the AXERA runtime libraries and the bundled binary reports missing shared libraries such as libax_engine.so, add /opt/lib to LD_LIBRARY_PATH before launch.

Option 2: install axllm from the public repository if you prefer a system-wide binary:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Run on the Board

From the package root on the board:

chmod +x ./bin/axllm
./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/Qwen3-0.6B-AX637

Health check:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}

{
  "data": [
    {
      "id": "AXERA-TECH/Qwen3-0.6B-AX637",
      "object": "model"
    }
  ],
  "object": "list"
}

If you prefer the interactive CLI:

chmod +x ./bin/axllm
./bin/axllm run .

Text Request

The following request was validated on AX637 and returned the final answer successfully:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/Qwen3-0.6B-AX637",
    "messages": [
      {"role": "user", "content": "/no_think\nWhat is the capital of the United States? Answer with the city name only."}
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "content": "Washington",
        "role": "assistant"
      }
    }
  ]
}

Practical note:

By default, some prompts may emit a reasoning-style preamble before the final answer.
If you want direct-answer behavior for short factual prompts, prefix the last user message with /no_think, as shown in the validated example above.
If you do not use /no_think, keep enough max_tokens budget for both the reasoning span and the final answer.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Original Hugging Face model: Qwen/Qwen3-0.6B
AXERA runtime repository: AXERA-TECH/ax-llm

Discussion

GitHub Issues: AXERA-TECH/ax-llm/issues
QQ group: 139953715

Downloads last month: 228

Model tree for AXERA-TECH/Qwen3-0.6B-AX637

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B