Instructions to use openbmb/MiniCPM-V-4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM-V-4-gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="openbmb/MiniCPM-V-4-gguf", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("openbmb/MiniCPM-V-4-gguf", trust_remote_code=True, dtype="auto") - llama-cpp-python
How to use openbmb/MiniCPM-V-4-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="openbmb/MiniCPM-V-4-gguf", filename="ggml-model-Q4_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use openbmb/MiniCPM-V-4-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf openbmb/MiniCPM-V-4-gguf:Q4_K_M
Use Docker
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use openbmb/MiniCPM-V-4-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM-V-4-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- SGLang
How to use openbmb/MiniCPM-V-4-gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-V-4-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-V-4-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use openbmb/MiniCPM-V-4-gguf with Ollama:
ollama run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- Unsloth Studio new
How to use openbmb/MiniCPM-V-4-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for openbmb/MiniCPM-V-4-gguf to start chatting
- Docker Model Runner
How to use openbmb/MiniCPM-V-4-gguf with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM-V-4-gguf:Q4_K_M
- Lemonade
How to use openbmb/MiniCPM-V-4-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull openbmb/MiniCPM-V-4-gguf:Q4_K_M
Run and chat with the model
lemonade run user.MiniCPM-V-4-gguf-Q4_K_M
List all available models
lemonade list
error loading model hyperparameters: invalid n_rot: 128, expected 80
./llama-minicpmv-cli -m ../ggml-model-Q4_0.gguf --mmproj ../mmproj-model-f16.gguf -c 4096 -ngl 100 --temp 0.1 --top-p 0.8 --top-k 100 --repeat-penalty 1.1 --image /work/zhouqingsong/bigmodel/image_4087700776
84281.png -p "图里面的红绿灯目前是什么状态?"
Log start
clip_model_load: description: image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 455
clip_model_load: n_kv: 19
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ../mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.has_text_encoder bool = false
clip_model_load: - kv 2: clip.has_vision_encoder bool = true
clip_model_load: - kv 3: clip.has_minicpmv_projector bool = true
clip_model_load: - kv 4: general.file_type u32 = 1
clip_model_load: - kv 5: general.description str = image encoder for MiniCPM-V
clip_model_load: - kv 6: clip.projector_type str = resampler
clip_model_load: - kv 7: clip.minicpmv_version i32 = 3
clip_model_load: - kv 8: clip.vision.image_size u32 = 448
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 0
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 15: clip.vision.block_count u32 = 27
clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 18: clip.use_gelu bool = true
clip_model_load: - type f32: 285 tensors
clip_model_load: - type f16: 170 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 1
clip_model_load: model size: 996.02 MB
clip_model_load: metadata size: 0.16 MB
clip_model_load: params backend buffer size = 996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
uhd_slice_image: multiple 6
uhd_slice_image: image_size: 1365 768; source_image size: 602 336
uhd_slice_image: image_size: 1365 768; best_grid: 3 2
uhd_slice_image: refine_image_size: 1470 812; refine_size: 1470 812
clip_image_preprocess: 602 336
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_preprocess: 490 406
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 7 encoded in 963.86 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 2 of 7 encoded in 638.17 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 3 of 7 encoded in 629.01 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 4 of 7 encoded in 590.18 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 5 of 7 encoded in 594.84 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 6 of 7 encoded in 611.90 ms
clip_image_build_graph: 490 406
encode_image_with_clip: step 7 of 7 encoded in 612.10 ms
encode_image_with_clip: all 7 segments encoded in 4643.62 ms
encode_image_with_clip: load_image_size 1365 768
encode_image_with_clip: image embedding created: 448 tokens
encode_image_with_clip: image encoded in 4647.04 ms by CLIP ( 10.37 ms per image patch)
llama_model_loader: loaded meta data with 32 key-value pairs and 291 tensors from ../ggml-model-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 3.6B
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.context_length u32 = 32768
llama_model_loader: - kv 6: llama.embedding_length u32 = 2560
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 10240
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 2
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: llama.attention.key_length u32 = 128
llama_model_loader: - kv 13: llama.attention.value_length u32 = 128
llama_model_loader: - kv 14: llama.vocab_size u32 = 73448
llama_model_loader: - kv 15: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 16: tokenizer.ggml.model str = llama
llama_model_loader: - kv 17: tokenizer.ggml.pre str = default
llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,73448] = ["", "", "", "", "<C...
llama_model_loader: - kv 19: tokenizer.ggml.scores arr[f32,73448] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,73448] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 73440
llama_model_loader: - kv 23: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 26: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 27: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 28: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 29: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_load: error loading model: error loading model hyperparameters: invalid n_rot: 128, expected 80
llama_load_model_from_file: failed to load model
llava_init: error: unable to load model
llava_init_context: error: failed to init minicpmv model
Segmentation fault (core dumped)