Spaces:
Running
Running
| # Deploy MiniCPM5-1B with llama.cpp | |
| `llama.cpp` is the recommended path for **CPU / edge / consumer-GPU** deployment. The released GGUF builds run on laptops, single-board computers, Apple Silicon, and Windows boxes with no Python at all. | |
| ## Released GGUF artifacts | |
| | File | Size | Use case | | |
| | --- | --- | --- | | |
| | `MiniCPM5-1B-F16.gguf` | 2.1 GB | reference quality, uniform CPU/GPU performance | | |
| | `MiniCPM5-1B-Q8_0.gguf` | 1.1 GB | very small quality drop vs F16, half the disk | | |
| | `MiniCPM5-1B-Q4_K_M.gguf` | 657 MB | edge / mobile-class hardware, minimal VRAM | | |
| These artifacts work directly with vanilla `llama.cpp` and every `llama.cpp`-based runtime (Ollama / LM Studio / `llama-cpp-python`). | |
| ## TL;DR — run a release GGUF | |
| ```bash | |
| huggingface-cli download openbmb/MiniCPM5-1B-GGUF MiniCPM5-1B-Q4_K_M.gguf --local-dir ./minicpm5 | |
| # Interactive chat (auto-applies the chat template) | |
| llama-cli -m ./minicpm5/MiniCPM5-1B-Q4_K_M.gguf -n 2048 --temp 0.7 --top-p 0.95 -ngl 99 | |
| ``` | |
| ## OpenAI-compatible server | |
| ```bash | |
| llama-server -m MiniCPM5-1B-Q4_K_M.gguf --port 8080 -ngl 99 -c 8192 --jinja | |
| curl http://localhost:8080/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "MiniCPM5-1B", | |
| "messages": [{"role": "user", "content": "1+1=?"}], | |
| "temperature": 0.7, "top_p": 0.95, "max_tokens": 256 | |
| }' | |
| ``` | |
| ## Generation parameters | |
| | Mode | `--temp` | `--top-p` | When to use | | |
| | --- | --- | --- | --- | | |
| | Think | 0.9 | 0.95 | reasoning, math, code, multi-step | | |
| | No-think | 0.7 | 0.95 | fast assistant, latency-bound | | |
| ## Build a GGUF from your own checkpoint | |
| If you've trained your own MiniCPM5-1B variant (continue-pretraining, domain SFT, …) and want to publish a GGUF, the pipeline is: | |
| ```bash | |
| git clone --depth=1 https://github.com/ggerganov/llama.cpp.git | |
| cd llama.cpp | |
| mkdir -p build && cd build | |
| # CPU-only build (sufficient for quantize + sanity check) | |
| cmake .. -DGGML_CUDA=OFF -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release | |
| cmake --build . --config Release -j $(nproc) --target llama-quantize llama-cli llama-server | |
| # Or a CUDA build for high-throughput inference | |
| # cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DCMAKE_BUILD_TYPE=Release | |
| # (set CMAKE_CUDA_ARCHITECTURES to your GPU compute capability, see NVIDIA docs) | |
| cd .. | |
| SRC=/path/to/your-MiniCPM5-fp16-hf | |
| OUT=/path/to/output | |
| # Run from the llama.cpp repository root cloned above. | |
| python ./convert_hf_to_gguf.py "$SRC" --outfile "$OUT/F16.gguf" --outtype f16 | |
| build/bin/llama-quantize "$OUT/F16.gguf" "$OUT/Q4_K_M.gguf" Q4_K_M | |
| build/bin/llama-quantize "$OUT/F16.gguf" "$OUT/Q8_0.gguf" Q8_0 | |
| ``` | |
| ## See also | |
| - [`ollama.md`](./ollama.md) — `ollama run` directly from these GGUFs | |
| - [`lmstudio.md`](./lmstudio.md) — desktop GUI for the same GGUFs | |
| --- | |
| _Source: https://github.com/OpenBMB/MiniCPM/blob/main/docs/deployment/llama_cpp.md (fetched 2026-06-15 for reference)._ | |