MiniCPM-V-4.6-GPTQ on AXERA NPU

Ready-to-run deployment package for openbmb/MiniCPM-V-4.6-GPTQ on AX650 / NPU3.

  • This release packages the AX650 axllm runtime together with the compiled text and vision .axmodel files.
  • The packaged text runtime uses the GPTQ INT4 build.
  • The packaged vision runtime uses a fixed-shape 448x448 MiniCPM-V-4.6 vision encoder.
  • The package supports text-only chat, single-image understanding, and video understanding through the OpenAI-compatible axllm serve API.
  • The package includes sample assets for image and video validation.

Supported Platform

  • AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based device:

  • AX650 / NPU3 development board

Performance

All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token, so the multimodal rows include media preprocessing and vision encoding time.

The text-only smoke prompt was kept within one 128-token prefill chunk. To avoid one-time startup effects, the text row below excludes the first request after service startup. Its Decode figure was measured with longer text-only generations (max_tokens=256) to better reflect sustained decode throughput; the short smoke reply used for the TTFT row is effectively a single-token answer and would otherwise under-report decode speed. The image row was measured with the packaged fixed-shape 448x448 vision encoder and assets/sample.png. The video row used the packaged sample video with video:assets/red-panda-openai.mp4:2.

Scenario Input tokens Prefill chunks TTFT Decode
Text-only smoke prompt 25 1 x 128 260.81 ms avg (259.01-262.61 ms) 24.07 token/s avg
Image prompt 88 1 x 128 719.79 ms avg (708.57-732.47 ms) 24.49 token/s avg
Video prompt 1271 10 x 128 9555.33 ms avg (9484.00-9647.62 ms) 23.87 token/s avg

The packaged runtime uses the following context layout:

  • prefill_len=128
  • kv_cache_len=2047
  • prefill_max_token_num=1280

Input tokens in the table above refers to the full request length after chat templating, not just the visual soft tokens. For the shipped 448x448 vision encoder, each selected image block contributes 64 visual soft tokens. Under the current packaged runtime settings, the sample video request in this README uses 1271 total input tokens and spans 10 prefill chunks.

Startup Runtime Footprint

Item Value
Flash total (text + post + vision axmodels) 1.19 GiB (1214.38 MiB)
Package flash total (current repository layout, excluding runtime-generated vision_cache/) 1.68 GiB (1719.30 MiB)
Runtime CMM increment during board-side startup 1.30 GiB (1334.05 MiB)

The runtime CMM value above was measured during board-side startup on a shared AX650 system and should be treated as a practical reference value.

Vision Encoder Latency

Measured on AX650 / NPU3 with /opt/bin/ax_run_model -m minicpmv4_6_vision_448.axmodel -g 0 -w 1 -r 5.

Model Resolution Soft Tokens Time (ms)
minicpmv4_6_vision_448.axmodel 448x448 64 235.285 ms avg

For this packaged AX650 runtime, the visual token count is fixed by the shipped vision encoder configuration:

  • vision_width = 448
  • vision_height = 448
  • vision_patch_size = 14
  • patch grid = (448 / 14) x (448 / 14) = 32 x 32
  • raw patch tokens = 32 x 32 = 1024
  • current packaged build uses the 16x visual compression path
  • Soft Tokens = 1024 / 16 = 64

So, for the fixed-shape runtime shipped in this repository, the relation is:

Soft Tokens = (vision_width / patch_size) x (vision_height / patch_size) / 16

Input tokens in the performance table can be larger than the visual Soft Tokens because axllm counts the full templated request, including user text and chat-template tokens in addition to the visual tokens. For the packaged assets/sample.png request in this README, the runtime reports input_num_token=88, which still fits within a single 128-token prefill chunk.

Soft Tokens is not a runtime-configurable value in this package. This repository ships only minicpmv4_6_vision_448.axmodel, so the board-side AX650 runtime always uses 448x448 -> 64 soft tokens for image encoding.

Package Layout

.
├── README.md
├── bin/
│   ├── axllm
│   └── axllm.version.json
├── assets/
│   ├── openai_api_demo.png
│   ├── red-panda-openai.mp4
│   └── sample.png
├── minicpmv4_6_vision_448.axmodel
├── qwen3_5_text_p128_l0_together.axmodel
├── ...
├── qwen3_5_text_p128_l23_together.axmodel
├── qwen3_5_text_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
├── config.json
├── post_config.json
└── minicpm_v46_tokenizer.txt

This package keeps the runtime files at the repository root so it can be served directly by axllm.

Sample Image

Both the axllm flow and the packaged sample requests can use the sample image: assets/sample.png

sample

Sample Video

The package also includes a packaged sample video for board-side video understanding validation:

  • assets/red-panda-openai.mp4

Direct Inference with axllm

The axllm workflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/MiniCPM-V-4.6-GPTQ
cd AXERA-TECH/MiniCPM-V-4.6-GPTQ
hf download AXERA-TECH/MiniCPM-V-4.6-GPTQ --local-dir .

Install axllm

Option 1: use the validated binary included in this repository:

chmod +x ./bin/axllm

Option 2: install axllm from the public repository:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 3: install with a one-line command:

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 4: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

The package root is already arranged for axllm, so no extra runtime path arguments are required.

For multimodal testing, you can use the packaged sample image shown above: ./assets/sample.png, or the packaged sample video: ./assets/red-panda-openai.mp4.

./bin/axllm run .

In interactive mode:

  • press Enter directly for text-only chat
  • input an image path for single-image chat
  • input video:/path/to/frames_dir or video:/path/to/video.mp4 for video chat

Serve with axllm

From the package root on the board:

./bin/axllm serve . --port 8000

Expected model id:

AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047

Health check:

curl http://127.0.0.1:8000/health

A typical startup log looks like this:

INF Init | LLM init start
INF Init | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
INF Init | attention config: layers=24 sliding=0 full=6 linear=18 sliding_window=0 ref_full_layer_idx=3
tokenizer_type = 3
huggingface tokenizer mode = gpt2_byte_bpe
...
INF Init | max_token_len : 2047
INF Init | kv_cache_size : 512, kv_cache_num: 2047
INF init_groups_from_model | prefill_token_num : 128
INF init_groups_from_model | prefill_max_token_num : 1280
INF Init | MiniCPM-V-4.6 token ids: image_pad=248056 video_pad=248057
INF Init | VisionModule init ok: type=MiniCPMV46VL, tokens_per_block=64, embed_size=1024, out_dtype=fp32
INF Init | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047

You can then send requests to the server using the API endpoints shown in the log. For example, to check the health status and list the available models:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Example output:

{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}
{
  "data": [
    {
      "created": 1780911663,
      "id": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}

openai_api_demo

Text Request

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is 1+1? Reply with the number only."}
        ]
      }
    ],
    "max_tokens": 32
  }'

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "2"
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Image Request

python3 - <<'PY'
import base64
import json
from pathlib import Path
from urllib.request import Request, urlopen

img = Path("assets/sample.png").read_bytes()
payload = {
    "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please briefly describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64," + base64.b64encode(img).decode()
                    },
                },
            ],
        }
    ],
    "max_tokens": 64,
}
req = Request(
    "http://127.0.0.1:8000/v1/chat/completions",
    data=json.dumps(payload).encode(),
    headers={"Content-Type": "application/json"},
)
with urlopen(req, timeout=60) as resp:
    print(resp.read().decode())
PY

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The image shows a colorful, cartoon-style red lobster or lobster-like character with a cheerful expression, raised claws, and a dynamic, action-oriented pose."
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Video Request

axllm serve accepts either a frames directory or a raw video file:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "video:/path/to/frames_dir"}},
          {"type": "text", "text": "Describe this video briefly."}
        ]
      }
    ],
    "max_tokens": 128
  }'

For a raw video file, use video:/path/to/video.mp4. If you need to request a specific sampling FPS, use the form video:/path/to/video.mp4:2.

To test the packaged sample video from the package root, you can set:

VIDEO_PATH="$(pwd)/assets/red-panda-openai.mp4"

and then use video:${VIDEO_PATH}:2 in the request payload.

Example output:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The red panda is seen playing with the other red panda."
      },
      "finish_reason": "stop"
    }
  ],
  "model": "AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047",
  "object": "chat.completion"
}

Browser UI with lite_webui

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM-V-4.6-GPTQ-AX650-C128-P1152-CTX2047.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Discussion

  • GitHub Issues
  • QQ group: 139953715
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/MiniCPM-V-4.6-GPTQ

Finetuned
(1)
this model

Collection including AXERA-TECH/MiniCPM-V-4.6-GPTQ