Instructions to use AXERA-TECH/MiniCPM-V-4.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/MiniCPM-V-4.6 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AXERA-TECH/MiniCPM-V-4.6")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/MiniCPM-V-4.6", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/MiniCPM-V-4.6 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/MiniCPM-V-4.6" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/MiniCPM-V-4.6
- SGLang
How to use AXERA-TECH/MiniCPM-V-4.6 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM-V-4.6" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/MiniCPM-V-4.6" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/MiniCPM-V-4.6", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/MiniCPM-V-4.6 with Docker Model Runner:
docker model run hf.co/AXERA-TECH/MiniCPM-V-4.6
MiniCPM-V-4.6 on AXERA NPU
Ready-to-run deployment package for openbmb/MiniCPM-V-4.6 on AX650 / NPU3.
- This release packages the AX650
axllmruntime together with the compiled text and vision.axmodelfiles. - The packaged text runtime uses the non-GPTQ BF16 build.
- The packaged vision runtime uses a fixed-shape
448x448MiniCPM-V-4.6 vision encoder. - The package supports text-only chat, single-image understanding, and video understanding through the OpenAI-compatible
axllm serveAPI. - The package also includes board-side and server-side Python reference scripts for reference use and comparison.
Supported Platform
- AX650 / NPU3
Validated Devices
This package has been validated on the following AX650-based device:
- AX650 / NPU3 development board
Performance
All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token. In this table, TTFT is measured end-to-end from request arrival at axllm serve to the first generated token, so the multimodal rows include media preprocessing and vision encoding time.
The text-only smoke prompt was kept within one 128-token prefill chunk. To avoid one-time startup effects, the text row below excludes the first request after service startup. Its Decode figure was measured with longer text-only generations (max_tokens=256) to better reflect sustained decode throughput; very short smoke replies under-report decode speed because EOS and response-tail overhead become relatively larger. The image row was measured with the packaged fixed-shape 448x448 vision encoder and assets/sample.png. The video row used the packaged sample video with video:assets/red-panda-openai.mp4:2.
| Scenario | Input tokens | Prefill chunks | TTFT | Decode |
|---|---|---|---|---|
| Text-only smoke prompt | 25 |
1 x 128 |
275.88 ms avg (274.97-276.78 ms) |
19.12 token/s avg |
| Image prompt | 88 |
1 x 128 |
729.89 ms avg (723.81-741.89 ms) |
19.02 token/s avg |
| Video prompt | 1271 |
10 x 128 |
9652.87 ms avg (9585.79-9735.26 ms) |
18.84 token/s avg |
The packaged runtime uses the following context layout:
prefill_len=128kv_cache_len=2047prefill_max_token_num=1280
Input tokens in the table above refers to the full request length after chat templating, not just the visual soft tokens. For the shipped 448x448 vision encoder, each selected image block contributes 64 visual soft tokens. Under the current packaged runtime settings, the sample video request in this README uses 1271 total input tokens and spans 10 prefill chunks.
Startup Runtime Footprint
| Item | Value |
|---|---|
Flash total (text + post + vision axmodels) |
1.42 GiB (1458.81 MiB) |
Package flash total (excluding vision_cache/) |
1.93 GiB (1979.79 MiB) |
Runtime CMM increment during board-side startup |
1.53 GiB (1564.55 MiB) |
The runtime CMM value above was measured during board-side startup on the validated AX650 board configuration and should be treated as a practical reference value.
Vision Encoder Latency
Measured on AX650 / NPU3 with /opt/bin/ax_run_model -m minicpmv4_6_vision_448.axmodel -g 0 -w 1 -r 5.
| Model | Resolution | Soft Tokens | Time (ms) |
|---|---|---|---|
minicpmv4_6_vision_448.axmodel |
448x448 |
64 |
234.827 ms avg |
For this packaged AX650 runtime, the visual token count is fixed by the shipped vision encoder configuration:
vision_width = 448vision_height = 448vision_patch_size = 14- patch grid =
(448 / 14) x (448 / 14) = 32 x 32 - raw patch tokens =
32 x 32 = 1024 - current packaged build uses the
16xvisual compression path Soft Tokens = 1024 / 16 = 64
So, for the fixed-shape runtime shipped in this repository, the relation is:
Soft Tokens = (vision_width / patch_size) x (vision_height / patch_size) / 16
Input tokens in the performance table can be larger than the visual Soft Tokens because axllm counts the full templated request, including user text and chat-template tokens in addition to the visual tokens. For the packaged assets/sample.png request in this README, the runtime reports input_num_token=88, which still fits within a single 128-token prefill chunk.
Soft Tokens is not a runtime-configurable value in this package. This repository ships only minicpmv4_6_vision_448.axmodel, so the board-side AX650 runtime always uses 448x448 -> 64 soft tokens for image encoding.
Package Layout
.
├── README.md
├── bin/
│ ├── axllm
│ └── axllm.version.json
├── assets/
│ ├── openai_api_demo.png
│ ├── red-panda-openai.mp4
│ └── sample.png
├── python/
│ ├── infer_axmodel.py
│ ├── infer_torch.py
│ └── minicpm_v46_tokenizer/
├── minicpmv4_6_vision_448.axmodel
├── qwen3_5_text_p128_l0_together.axmodel
├── ...
├── qwen3_5_text_p128_l23_together.axmodel
├── qwen3_5_text_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
├── config.json
├── post_config.json
└── minicpm_v46_tokenizer.txt
This package uses a hybrid layout: the packaged axllm runtime plus the compiled .axmodel files live at the repository root, while the Python reference scripts and the tokenizer directory used by those scripts stay under python/.
Sample Image
Both the axllm flow and the packaged Python examples can use the sample image:
assets/sample.png
Sample Video
The package also includes a packaged sample video for board-side video understanding validation:
assets/red-panda-openai.mp4
Direct Inference with axllm
The
axllmworkflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.
Download the Model Package
Download the release package from Hugging Face:
mkdir -p AXERA-TECH/MiniCPM-V-4.6
cd AXERA-TECH/MiniCPM-V-4.6
hf download AXERA-TECH/MiniCPM-V-4.6 --local-dir .
Install axllm
Option 1: use the validated binary included in this repository:
chmod +x ./bin/axllm
Option 2: install axllm from the public repository:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
Option 3: install with a one-line command:
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
Option 4: download the prebuilt binary from GitHub Actions CI:
If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
Then run:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
Run on the Board
The package root is already arranged for axllm, so no extra runtime path arguments are required.
For multimodal testing, you can use the packaged sample image shown above: ./assets/sample.png, or the packaged sample video: ./assets/red-panda-openai.mp4.
./bin/axllm run .
In interactive mode:
- press
Enterdirectly for text-only chat - input an image path for single-image chat
- input
video:/path/to/frames_dirorvideo:/path/to/video.mp4for video chat
Serve with axllm
From the package root on the board:
./bin/axllm serve . --port 8000
Expected model id:
AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047
Health check:
curl http://127.0.0.1:8000/health
A typical startup log looks like this:
INF Init | LLM init start
INF Init | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
INF Init | attention config: layers=24 sliding=0 full=6 linear=18 sliding_window=0 ref_full_layer_idx=3
tokenizer_type = 3
huggingface tokenizer mode = gpt2_byte_bpe
...
INF Init | max_token_len : 2047
INF Init | kv_cache_size : 512, kv_cache_num: 2047
INF init_groups_from_model | prefill_token_num : 128
INF init_groups_from_model | prefill_max_token_num : 1280
INF Init | MiniCPM-V-4.6 token ids: image_pad=248056 video_pad=248057
INF Init | VisionModule init ok: type=MiniCPMV46VL, tokens_per_block=64, embed_size=1024, out_dtype=fp32
INF Init | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047'...
API URLs:
GET http://127.0.0.1:8000/health
GET http://127.0.0.1:8000/v1/models
POST http://127.0.0.1:8000/v1/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047
You can then send requests to the server using the API endpoints shown in the log. For example, to check the health status and list the available models:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Example output:
{
"concurrency": 0,
"max_concurrency": 1,
"status": "healthy"
}
{
"data": [
{
"created": 1780908633,
"id": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"object": "model",
"owned_by": "openai-api"
}
],
"object": "list"
}
Text Request
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is 1+1? Reply with the number only."}
]
}
],
"max_tokens": 32
}'
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "1+1 is 2."
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Image Request
python3 - <<'PY'
import base64
import json
from pathlib import Path
from urllib.request import Request, urlopen
img = Path("assets/sample.png").read_bytes()
payload = {
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Please briefly describe this image."},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64," + base64.b64encode(img).decode()
},
},
],
}
],
"max_tokens": 64,
}
req = Request(
"http://127.0.0.1:8000/v1/chat/completions",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
with urlopen(req, timeout=60) as resp:
print(resp.read().decode())
PY
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "好的,这张图片是一个风格化的红色龙虾卡通形象。它有着夸张的表情和动态的姿势,显得非常活泼和有力。龙虾的肢体姿态显示出它正在准备出击或展示它的力量,整体设计充满了动感和趣味性。这个形象可能用于装饰或象征某种活力和力量。"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Video Request
axllm serve accepts either a frames directory or a raw video file:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "video:/path/to/frames_dir"}},
{"type": "text", "text": "Describe this video."}
]
}
],
"max_tokens": 256
}'
For a raw video file, use video:/path/to/video.mp4. If you need to request a specific sampling FPS, use the form video:/path/to/video.mp4:2.
To test the packaged sample video from the package root, you can set:
VIDEO_PATH="$(pwd)/assets/red-panda-openai.mp4"
and then use video:${VIDEO_PATH}:2 in the request payload.
Example output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "好的,这是当前对视频内容的详细描述:画面中,两只红熊猫正在一个由竹竿搭建的攀爬架周围活动。一只红熊猫正趴在竹竿上,身体伸展,尾巴自然垂落;另一只红熊猫则蹲在下方,抬头向上,似乎正在尝试攀爬或探索竹竿结构。背景是绿色的围栏和草地,整个场景展现了它们活泼、好奇的互动状态。"
},
"finish_reason": "stop"
}
],
"model": "AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047",
"object": "chat.completion"
}
Browser UI with lite_webui
If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.
Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/MiniCPM-V-4.6-AX650-C128-P1152-CTX2047.
Python Runtime Requirements
Install the following packages before using the packaged Python reference scripts:
- Board-side
infer_axmodel.py:pyaxengine,transformers,numpy,ml_dtypes - Server-side
infer_torch.py:torch,transformers
The packaged Python scripts are reference utilities rather than the main runtime path.
Legacy Python Demo Flow
Text-Only Inference
python/infer_axmodel.py is intended for board-side text debugging of the packaged runtime files:
cd python
python3 infer_axmodel.py \
--hf-model ./minicpm_v46_tokenizer \
--axmodel-dir .. \
--mode generate \
--prompt "What is 1+1? Reply with the number only." \
--prompt-mode prefill \
--max-new-tokens 16 \
--kv-cache-len 2047
Hugging Face Reference Inference
python/infer_torch.py is intended for x86 or GPU-side comparison against the original Hugging Face model:
cd python
python infer_torch.py \
--model-path /path/to/original/MiniCPM-V-4.6 \
--prompt "Please give a short self introduction."
Packaged Python Runtime Paths
The packaged Python helper paths are:
python/infer_axmodel.pypython/infer_torch.pypython/minicpm_v46_tokenizer/
The packaged axllm runtime does not depend on python/minicpm_v46_tokenizer/, but python/infer_axmodel.py uses it by default.
These path arguments apply to the Python demo flow only. The axllm flow reads the same root-level runtime files packaged in this repository.
Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: openbmb/MiniCPM-V-4.6
- AXERA conversion and deployment workflow: AXERA-TECH/MiniCPM-V-4.6.axera
Discussion
- GitHub Issues
- QQ group:
139953715
- Downloads last month
- 27
Model tree for AXERA-TECH/MiniCPM-V-4.6
Base model
openbmb/MiniCPM-V-4.6
