Instructions to use openbmb/MiniCPM-V-4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/MiniCPM-V-4-gguf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4-gguf", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("openbmb/MiniCPM-V-4-gguf", trust_remote_code=True, dtype="auto")

llama-cpp-python

How to use openbmb/MiniCPM-V-4-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="openbmb/MiniCPM-V-4-gguf",
	filename="ggml-model-Q4_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use openbmb/MiniCPM-V-4-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M

Use Docker

docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M

LM Studio
Jan

vLLM

How to use openbmb/MiniCPM-V-4-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/MiniCPM-V-4-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4-gguf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M

SGLang

How to use openbmb/MiniCPM-V-4-gguf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/MiniCPM-V-4-gguf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4-gguf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/MiniCPM-V-4-gguf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM-V-4-gguf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use openbmb/MiniCPM-V-4-gguf with Ollama:
```
ollama run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
```

Unsloth Studio new

How to use openbmb/MiniCPM-V-4-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openbmb/MiniCPM-V-4-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openbmb/MiniCPM-V-4-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for openbmb/MiniCPM-V-4-gguf to start chatting

Docker Model Runner
How to use openbmb/MiniCPM-V-4-gguf with Docker Model Runner:
```
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
```

Lemonade

How to use openbmb/MiniCPM-V-4-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull openbmb/MiniCPM-V-4-gguf:Q4_K_M

Run and chat with the model

lemonade run user.MiniCPM-V-4-gguf-Q4_K_M

List all available models

lemonade list

error loading model hyperparameters: invalid n_rot: 128, expected 80

by ziluo - opened Apr 21

Discussion

ziluo

Apr 21

./llama-minicpmv-cli -m ../ggml-model-Q4_0.gguf --mmproj ../mmproj-model-f16.gguf -c 4096 -ngl 100 --temp 0.1 --top-p 0.8 --top-k 100 --repeat-penalty 1.1 --image /work/zhouqingsong/bigmodel/image_4087700776
84281.png -p "图里面的红绿灯目前是什么状态?"
Log start
clip_model_load: description: image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 455
clip_model_load: n_kv: 19
clip_model_load: ftype: f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ../mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_minicpmv_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for MiniCPM-V
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.minicpmv_version i32 = 3
clip_model_load: - kv 8: clip.vision.image_size u32 = 448
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 15: clip.vision.block_count u32 = 27
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 18: clip.use_gelu bool = true
clip_model_load: - type f32: 285 tensors
clip_model_load: - type f16: 170 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 1
clip_model_load: model size: 996.02 MB
clip_model_load: metadata size: 0.16 MB
clip_model_load: params backend buffer size = 996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
uhd_slice_image: multiple 6
uhd_slice_image: image_size: 1365 768; source_image size: 602 336
uhd_slice_image: image_size: 1365 768; best_grid: 3 2
uhd_slice_image: refine_image_size: 1470 812; refine_size: 1470 812
clip_image_preprocess: 602 336
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 7 encoded in 963.86 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 2 of 7 encoded in 638.17 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 3 of 7 encoded in 629.01 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 4 of 7 encoded in 590.18 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 5 of 7 encoded in 594.84 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 6 of 7 encoded in 611.90 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 7 of 7 encoded in 612.10 ms
encode_image_with_clip: all 7 segments encoded in 4643.62 ms
encode_image_with_clip: load_image_size 1365 768
encode_image_with_clip: image embedding created: 448 tokens

encode_image_with_clip: image encoded in 4647.04 ms by CLIP ( 10.37 ms per image patch)
llama_model_loader: loaded meta data with 32 key-value pairs and 291 tensors from ../ggml-model-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 3.6B
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.context_length u32 = 32768
llama_model_loader: - kv 6: llama.embedding_length u32 = 2560
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 10240
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 2
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: llama.attention.key_length u32 = 128
llama_model_loader: - kv 13: llama.attention.value_length u32 = 128
llama_model_loader: - kv 14: llama.vocab_size u32 = 73448
llama_model_loader: - kv 15: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = llama
llama_model_loader: - kv 17: tokenizer.ggml.pre str = default
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,73448] = ["", "~~", "~~", "", "<C...
llama_model_loader: - kv 19: tokenizer.ggml.scores arr[f32,73448] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,73448] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 73440
llama_model_loader: - kv 23: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 26: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 27: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 28: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 29: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_load: error loading model: error loading model hyperparameters: invalid n_rot: 128, expected 80
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
llava_init_context: error: failed to init minicpmv model
Segmentation fault (core dumped)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment