MiniCPM-V-4.6 on AXERA NPU

Ready-to-run deployment package for openbmb/MiniCPM-V-4.6 on AX650 / NPU3.

  • This release packages the AX650 axllm runtime together with the compiled text and vision .axmodel files.
  • The packaged text runtime uses the non-GPTQ BF16 build.
  • The packaged vision runtime uses a fixed-shape 448x448 MiniCPM-V-4.6 vision encoder.
  • The package supports text-only chat, single-image understanding, and video understanding through the OpenAI-compatible axllm serve API.
  • The package also includes board-side and server-side Python reference scripts for reference use and comparison.

Supported Platform

  • AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based device:

  • AX650 / NPU3 development board

Performance

All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token, so the multimodal rows include media preprocessing and vision encoding time.

The text-only smoke prompt was kept within one 128-token prefill chunk. To avoid one-time startup effects, the text row below excludes the first request after service startup. Its Decode figure was measured with longer text-only generations (max_tokens=256) to better reflect sustained decode throughput; very short smoke replies under-report decode speed because EOS and response-tail overhead become relatively larger. The image row was measured with the packaged fixed-shape 448x448 vision encoder and assets/sample.png. The video row used the packaged sample video with video:assets/red-panda-openai.mp4:2.

Scenario Input tokens Prefill chunks TTFT Decode
Text-only smoke prompt 25 1 x 128 275.88 ms avg (274.97-276.78 ms) 19.12 token/s avg
Image prompt 88 1 x 128 729.89 ms avg (723.81-741.89 ms) 19.02 token/s avg
Video prompt 1271 10 x 128 9652.87 ms avg (9585.79-9735.26 ms) 18.84 token/s avg

The packaged runtime uses the following context layout:

  • prefill_len=128
  • kv_cache_len=2047
  • prefill_max_token_num=1280

Input tokens in the table above refers to the full request length after chat templating, not just the visual soft tokens. For the shipped 448x448 vision encoder, each selected image block contributes 64 visual soft tokens. Under the current packaged runtime settings, the sample video request in this README uses 1271 total input tokens and spans 10 prefill chunks.

Startup Runtime Footprint

Item Value
Flash total (text + post + vision axmodels) 1.42 GiB (1458.81 MiB)
Package flash total (excluding vision_cache/) 1.93 GiB (1979.79 MiB)
Runtime CMM increment during board-side startup 1.53 GiB (1564.55 MiB)

The runtime CMM value above was measured during board-side startup on the validated AX650 board configuration and should be treated as a practical reference value.

Vision Encoder Latency

Measured on AX650 / NPU3 with /opt/bin/ax_run_model -m minicpmv4_6_vision_448.axmodel -g 0 -w 1 -r 5.

Model Resolution Soft Tokens Time (ms)
minicpmv4_6_vision_448.axmodel 448x448 64 234.827 ms avg

For this packaged AX650 runtime, the visual token count is fixed by the shipped vision encoder configuration:

  • vision_width = 448
  • vision_height = 448
  • vision_patch_size = 14
  • patch grid = (448 / 14) x (448 / 14) = 32 x 32
  • raw patch tokens = 32 x 32 = 1024
  • current packaged build uses the 16x visual compression path
  • Soft Tokens = 1024 / 16 = 64

So, for the fixed-shape runtime shipped in this repository, the relation is:

Soft Tokens = (vision_width / patch_size) x (vision_height / patch_size) / 16

Input tokens in the performance table can be larger than the visual Soft Tokens because axllm counts the full templated request, including user text and chat-template tokens in addition to the visual tokens. For the packaged assets/sample.png request in this README, the runtime reports input_num_token=88, which still fits within a single 128-token prefill chunk.

Soft Tokens is not a runtime-configurable value in this package. This repository ships only minicpmv4_6_vision_448.axmodel, so the board-side AX650 runtime always uses 448x448 -> 64 soft tokens for image encoding.

Package Layout

.
├── README.md
├── bin/
│   ├── axllm
│   └── axllm.version.json
├── assets/
│   ├── openai_api_demo.png
│   ├── red-panda-openai.mp4
│   └── sample.png
├── python/
│   ├── infer_axmodel.py
│   ├── infer_torch.py
│   └── minicpm_v46_tokenizer/
├── minicpmv4_6_vision_448.axmodel
├── qwen3_5_text_p128_l0_together.axmodel
├── ...
├── qwen3_5_text_p128_l23_together.axmodel
├── qwen3_5_text_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
├── config.json
├── post_config.json
└── minicpm_v46_tokenizer.txt

This package uses a hybrid layout: the packaged axllm runtime plus the compiled .axmodel files live at the repository root, while the Python reference scripts and the tokenizer directory used by those scripts stay under python/.

Sample Image

Both the axllm flow and the packaged Python examples can use the sample image: assets/sample.png

sample

Sample Video

The package also includes a packaged sample video for board-side video understanding validation:

  • assets/red-panda-openai.mp4

Direct Inference with axllm

The axllm workflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/MiniCPM-V-4.6
cd AXERA-TECH/MiniCPM-V-4.6
hf download AXERA-TECH/MiniCPM-V-4.6 --local-dir .

Install axllm

Option 1: use the validated binary included in this repository:

chmod +x ./bin/axllm

Option 2: install axllm from the public repository:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 3: install with a one-line command:

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 4: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

The package root is already arranged for axllm, so no extra runtime path arguments are required.

For multimodal testing, you can use the packaged sample image shown above: ./assets/sample.png, or the packaged sample video: ./assets/red-panda-openai.mp4.

./bin/axllm run .

In interactive mode:

  • press Enter directly for text-only chat
  • input an image path for single-image chat
  • input video:/path/to/frames_dir or video:/path/to/video.mp4 for video chat

Serve with axllm

From the package root on the board:

./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047

Health check:

curl http://127.0.0.1:8000/health

A typical startup log looks like this:

INF Init | LLM init start
INF Init | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
INF Init | attention config: layers=24 sliding=0 full=6 linear=18 sliding_window=0 ref_full_layer_idx=3
tokenizer_type = 3
huggingface tokenizer mode = gpt2_byte_bpe
...
INF Init | max_token_len : 2047
INF Init | kv_cache_size : 512, kv_cache_num: 2047
INF init_groups_from_model | prefill_token_num : 128
INF init_groups_from_model | prefill_max_token_num : 1280
INF Init | MiniCPM-V-4.6 token ids: image_pad=248056 video_pad=248057
INF Init | VisionModule init ok: type=MiniCPMV46VL, tokens_per_block=64, embed_size=1024, out_dtype=fp32
INF Init | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047

You can then send requests to the server using the API endpoints shown in the log. For example, to check the health status and list the available models:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}
{
  "data": [
    {
      "created": 1780908633,
      "id": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}

openai_api_demo

Text Request

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is 1+1? Reply with the number only."}
        ]
      }
    ],
    "max_tokens": 32
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "1+1 is 2."
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Image Request

python3 - <<'PY'
import base64
import json
from pathlib import Path
from urllib.request import Request, urlopen

img = Path("assets/sample.png").read_bytes()
payload = {
    "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please briefly describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64," + base64.b64encode(img).decode()
                    },
                },
            ],
        }
    ],
    "max_tokens": 64,
}
req = Request(
    "http://127.0.0.1:8000/v1/chat/completions",
    data=json.dumps(payload).encode(),
    headers={"Content-Type": "application/json"},
)
with urlopen(req, timeout=60) as resp:
    print(resp.read().decode())
PY

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "好的,这张图片是一个风格化的红色龙虾卡通形象。它有着夸张的表情和动态的姿势,显得非常活泼和有力。龙虾的肢体姿态显示出它正在准备出击或展示它的力量,整体设计充满了动感和趣味性。这个形象可能用于装饰或象征某种活力和力量。"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Video Request

axllm serve accepts either a frames directory or a raw video file:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "video:/path/to/frames_dir"}},
          {"type": "text", "text": "Describe this video."}
        ]
      }
    ],
    "max_tokens": 256
  }'

For a raw video file, use video:/path/to/video.mp4. If you need to request a specific sampling FPS, use the form video:/path/to/video.mp4:2.

To test the packaged sample video from the package root, you can set:

VIDEO_PATH="$(pwd)/assets/red-panda-openai.mp4"

and then use video:${VIDEO_PATH}:2 in the request payload.

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "好的,这是当前对视频内容的详细描述:画面中,两只红熊猫正在一个由竹竿搭建的攀爬架周围活动。一只红熊猫正趴在竹竿上,身体伸展,尾巴自然垂落;另一只红熊猫则蹲在下方,抬头向上,似乎正在尝试攀爬或探索竹竿结构。背景是绿色的围栏和草地,整个场景展现了它们活泼、好奇的互动状态。"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Browser UI with lite_webui

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047.

Python Runtime Requirements

Install the following packages before using the packaged Python reference scripts:

  • Board-side infer_axmodel.py: pyaxengine, transformers, numpy, ml_dtypes
  • Server-side infer_torch.py: torch, transformers

The packaged Python scripts are reference utilities rather than the main runtime path.

Legacy Python Demo Flow

Text-Only Inference

python/infer_axmodel.py is intended for board-side text debugging of the packaged runtime files:

cd python
python3 infer_axmodel.py \
  --hf-model ./minicpm_v46_tokenizer \
  --axmodel-dir .. \
  --mode generate \
  --prompt "What is 1+1? Reply with the number only." \
  --prompt-mode prefill \
  --max-new-tokens 16 \
  --kv-cache-len 2047

Hugging Face Reference Inference

python/infer_torch.py is intended for x86 or GPU-side comparison against the original Hugging Face model:

cd python
python infer_torch.py \
  --model-path /path/to/original/MiniCPM-V-4.6 \
  --prompt "Please give a short self introduction."

Packaged Python Runtime Paths

The packaged Python helper paths are:

  • python/infer_axmodel.py
  • python/infer_torch.py
  • python/minicpm_v46_tokenizer/

The packaged axllm runtime does not depend on python/minicpm_v46_tokenizer/, but python/infer_axmodel.py uses it by default.

These path arguments apply to the Python demo flow only. The axllm flow reads the same root-level runtime files packaged in this repository.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Discussion

  • GitHub Issues
  • QQ group: 139953715
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/MiniCPM-V-4.6

Finetuned
(10)
this model

Collection including AXERA-TECH/MiniCPM-V-4.6