Qwen3-0.6B on AXERA NPU
Ready-to-run deployment package for Qwen/Qwen3-0.6B on AX637.
- This release packages the AX637 text-only
axllmruntime files. - The package includes the bundled
axllmbinary, compiled text.axmodelfiles, tokenizer data, and runtime configs. - The validated runtime layout uses
prefill_len=128,kv_cache_len=1023, andprefill_max_token_num=768. - This package supports text-only chat through
axllm runand the OpenAI-compatibleaxllm serveAPI.
Supported Platform
- AX637
Validated Devices
This package has been validated on the following device class:
- AX637 development board
Performance
All measurements below were taken on AX637 with the packaged binary launched from the package root.
TTFT stands for time to first token.
The table below reports steady-state measurements and excludes the first request after startup.
For this Qwen3 package, the benchmark prompts use the model's public /no_think instruction so that the measured outputs reflect direct answers or direct generation rather than a variable-length reasoning preamble.
Output tokens below are model-token counts observed in the board-side runtime log.
| Scenario | Input tokens | Output tokens | Prefill chunks | TTFT | Decode |
|---|---|---|---|---|---|
Short-answer text request (/no_think + city-name answer) |
28 |
5 |
1 x 128 |
234.47 ms |
6.50 token/s avg |
Long-output text generation (/no_think + number sequence request) |
30 |
256 |
1 x 128 |
233.89 ms |
8.13 token/s avg |
For sustained decode throughput, use the long-output row as the more representative reference because the short-answer row terminates after only a few generated tokens.
The packaged runtime uses the following context layout:
prefill_len=128kv_cache_len=1023prefill_max_token_num=768
Startup Runtime Footprint
| Item | Value |
|---|---|
Package flash total |
1.30 GiB |
Runtime CMM increment during board-side startup |
774 MB |
The CMM value above was measured from the validated AX637 board-side startup log, where remain_cmm decreased from 2005 MB after the first text layer init to 1231 MB after post-model init.
Package Layout
.
βββ README.md
βββ bin/
β βββ axllm
β βββ axllm.version.json
βββ config.json
βββ post_config.json
βββ qwen3_tokenizer.txt
βββ model.embed_tokens.weight.bfloat16.bin
βββ qwen3_p128_l0_together.axmodel
βββ ...
βββ qwen3_p128_l27_together.axmodel
βββ qwen3_post.axmodel
This package uses a flat runtime layout. The bundled axllm binary lives under bin/, and the compiled text runtime files live at the repository root.
Direct Inference with axllm
Download the Model Package
Download the release package from Hugging Face:
mkdir -p AXERA-TECH/Qwen3-0.6B-AX637
cd AXERA-TECH/Qwen3-0.6B-AX637
hf download AXERA-TECH/Qwen3-0.6B-AX637 --local-dir .
Install axllm
Option 1: use the bundled binary in this repository.
chmod +x ./bin/axllm
If your shell does not already expose the AXERA runtime libraries and the bundled binary reports missing shared libraries such as libax_engine.so, add /opt/lib to LD_LIBRARY_PATH before launch.
Option 2: install axllm from the public repository if you prefer a system-wide binary:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
Run on the Board
From the package root on the board:
chmod +x ./bin/axllm
./bin/axllm serve . --port 8000
Expected model id:
AXERA-TECH/Qwen3-0.6B-AX637
Health check:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Example output:
{
"concurrency": 0,
"max_concurrency": 1,
"status": "healthy"
}
{
"data": [
{
"id": "AXERA-TECH/Qwen3-0.6B-AX637",
"object": "model"
}
],
"object": "list"
}
If you prefer the interactive CLI:
chmod +x ./bin/axllm
./bin/axllm run .
Text Request
The following request was validated on AX637 and returned the final answer successfully:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/Qwen3-0.6B-AX637",
"messages": [
{"role": "user", "content": "/no_think\nWhat is the capital of the United States? Answer with the city name only."}
],
"max_tokens": 32,
"temperature": 0
}'
Example output:
{
"choices": [
{
"message": {
"content": "Washington",
"role": "assistant"
}
}
]
}
Practical note:
- By default, some prompts may emit a reasoning-style preamble before the final answer.
- If you want direct-answer behavior for short factual prompts, prefix the last user message with
/no_think, as shown in the validated example above. - If you do not use
/no_think, keep enoughmax_tokensbudget for both the reasoning span and the final answer.
Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: Qwen/Qwen3-0.6B
- AXERA runtime repository: AXERA-TECH/ax-llm
Discussion
- GitHub Issues: AXERA-TECH/ax-llm/issues
- QQ group:
139953715
- Downloads last month
- 228