Qwen3-0.6B on AXERA NPU

Ready-to-run deployment package for Qwen/Qwen3-0.6B on AX637.

  • This release packages the AX637 text-only axllm runtime files.
  • The package includes the bundled axllm binary, compiled text .axmodel files, tokenizer data, and runtime configs.
  • The validated runtime layout uses prefill_len=128, kv_cache_len=1023, and prefill_max_token_num=768.
  • This package supports text-only chat through axllm run and the OpenAI-compatible axllm serve API.

Supported Platform

  • AX637

Validated Devices

This package has been validated on the following device class:

  • AX637 development board

Performance

All measurements below were taken on AX637 with the packaged binary launched from the package root.

TTFT stands for time to first token.

The table below reports steady-state measurements and excludes the first request after startup.

For this Qwen3 package, the benchmark prompts use the model's public /no_think instruction so that the measured outputs reflect direct answers or direct generation rather than a variable-length reasoning preamble.

Output tokens below are model-token counts observed in the board-side runtime log.

Scenario Input tokens Output tokens Prefill chunks TTFT Decode
Short-answer text request (/no_think + city-name answer) 28 5 1 x 128 234.47 ms 6.50 token/s avg
Long-output text generation (/no_think + number sequence request) 30 256 1 x 128 233.89 ms 8.13 token/s avg

For sustained decode throughput, use the long-output row as the more representative reference because the short-answer row terminates after only a few generated tokens.

The packaged runtime uses the following context layout:

  • prefill_len=128
  • kv_cache_len=1023
  • prefill_max_token_num=768

Startup Runtime Footprint

Item Value
Package flash total 1.30 GiB
Runtime CMM increment during board-side startup 774 MB

The CMM value above was measured from the validated AX637 board-side startup log, where remain_cmm decreased from 2005 MB after the first text layer init to 1231 MB after post-model init.

Package Layout

.
β”œβ”€β”€ README.md
β”œβ”€β”€ bin/
β”‚   β”œβ”€β”€ axllm
β”‚   └── axllm.version.json
β”œβ”€β”€ config.json
β”œβ”€β”€ post_config.json
β”œβ”€β”€ qwen3_tokenizer.txt
β”œβ”€β”€ model.embed_tokens.weight.bfloat16.bin
β”œβ”€β”€ qwen3_p128_l0_together.axmodel
β”œβ”€β”€ ...
β”œβ”€β”€ qwen3_p128_l27_together.axmodel
└── qwen3_post.axmodel

This package uses a flat runtime layout. The bundled axllm binary lives under bin/, and the compiled text runtime files live at the repository root.

Direct Inference with axllm

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/Qwen3-0.6B-AX637
cd AXERA-TECH/Qwen3-0.6B-AX637
hf download AXERA-TECH/Qwen3-0.6B-AX637 --local-dir .

Install axllm

Option 1: use the bundled binary in this repository.

chmod +x ./bin/axllm

If your shell does not already expose the AXERA runtime libraries and the bundled binary reports missing shared libraries such as libax_engine.so, add /opt/lib to LD_LIBRARY_PATH before launch.

Option 2: install axllm from the public repository if you prefer a system-wide binary:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Run on the Board

From the package root on the board:

chmod +x ./bin/axllm
./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/Qwen3-0.6B-AX637

Health check:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}
{
  "data": [
    {
      "id": "AXERA-TECH/Qwen3-0.6B-AX637",
      "object": "model"
    }
  ],
  "object": "list"
}

If you prefer the interactive CLI:

chmod +x ./bin/axllm
./bin/axllm run .

Text Request

The following request was validated on AX637 and returned the final answer successfully:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/Qwen3-0.6B-AX637",
    "messages": [
      {"role": "user", "content": "/no_think\nWhat is the capital of the United States? Answer with the city name only."}
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "content": "Washington",
        "role": "assistant"
      }
    }
  ]
}

Practical note:

  • By default, some prompts may emit a reasoning-style preamble before the final answer.
  • If you want direct-answer behavior for short factual prompts, prefix the last user message with /no_think, as shown in the validated example above.
  • If you do not use /no_think, keep enough max_tokens budget for both the reasoning span and the final answer.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Discussion

Downloads last month
228
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AXERA-TECH/Qwen3-0.6B-AX637

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(1002)
this model