Spaces:

build-small-hackathon
/

daimon

Running

App Files Files Community

daimon / model /reference /llama_cpp.md

davidquicast

chore: initial commit

f0347b4 1 day ago

preview code

Raw

History Blame Contribute Delete

2.92 kB

	# Deploy MiniCPM5-1B with llama.cpp

	`llama.cpp` is the recommended path for CPU / edge / consumer-GPU deployment. The released GGUF builds run on laptops, single-board computers, Apple Silicon, and Windows boxes with no Python at all.

	## Released GGUF artifacts

	\| File \| Size \| Use case \|
	\| --- \| --- \| --- \|
	\| `MiniCPM5-1B-F16.gguf` \| 2.1 GB \| reference quality, uniform CPU/GPU performance \|
	\| `MiniCPM5-1B-Q8_0.gguf` \| 1.1 GB \| very small quality drop vs F16, half the disk \|
	\| `MiniCPM5-1B-Q4_K_M.gguf` \| 657 MB \| edge / mobile-class hardware, minimal VRAM \|

	These artifacts work directly with vanilla `llama.cpp` and every `llama.cpp`-based runtime (Ollama / LM Studio / `llama-cpp-python`).

	## TL;DR — run a release GGUF

	```bash
	huggingface-cli download openbmb/MiniCPM5-1B-GGUF MiniCPM5-1B-Q4_K_M.gguf --local-dir ./minicpm5

	# Interactive chat (auto-applies the chat template)
	llama-cli -m ./minicpm5/MiniCPM5-1B-Q4_K_M.gguf -n 2048 --temp 0.7 --top-p 0.95 -ngl 99
	```

	## OpenAI-compatible server

	```bash
	llama-server -m MiniCPM5-1B-Q4_K_M.gguf --port 8080 -ngl 99 -c 8192 --jinja

	curl http://localhost:8080/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "MiniCPM5-1B",
	"messages": [{"role": "user", "content": "1+1=?"}],
	"temperature": 0.7, "top_p": 0.95, "max_tokens": 256
	}'
	```

	## Generation parameters

	\| Mode \| `--temp` \| `--top-p` \| When to use \|
	\| --- \| --- \| --- \| --- \|
	\| Think \| 0.9 \| 0.95 \| reasoning, math, code, multi-step \|
	\| No-think \| 0.7 \| 0.95 \| fast assistant, latency-bound \|

	## Build a GGUF from your own checkpoint

	If you've trained your own MiniCPM5-1B variant (continue-pretraining, domain SFT, …) and want to publish a GGUF, the pipeline is:

	```bash
	git clone --depth=1 https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp
	mkdir -p build && cd build

	# CPU-only build (sufficient for quantize + sanity check)
	cmake .. -DGGML_CUDA=OFF -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
	cmake --build . --config Release -j $(nproc) --target llama-quantize llama-cli llama-server

	# Or a CUDA build for high-throughput inference
	# cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 -DCMAKE_BUILD_TYPE=Release
	# (set CMAKE_CUDA_ARCHITECTURES to your GPU compute capability, see NVIDIA docs)

	cd ..
	SRC=/path/to/your-MiniCPM5-fp16-hf
	OUT=/path/to/output

	# Run from the llama.cpp repository root cloned above.
	python ./convert_hf_to_gguf.py "$SRC" --outfile "$OUT/F16.gguf" --outtype f16
	build/bin/llama-quantize "$OUT/F16.gguf" "$OUT/Q4_K_M.gguf" Q4_K_M
	build/bin/llama-quantize "$OUT/F16.gguf" "$OUT/Q8_0.gguf" Q8_0
	```

	## See also

	- [`ollama.md`](./ollama.md) — `ollama run` directly from these GGUFs
	- [`lmstudio.md`](./lmstudio.md) — desktop GUI for the same GGUFs

	---

	_Source: https://github.com/OpenBMB/MiniCPM/blob/main/docs/deployment/llama_cpp.md (fetched 2026-06-15 for reference)._