Instructions to use SixVolts/GLM-5.2-ewaste-edition-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SixVolts/GLM-5.2-ewaste-edition-GGUF", filename="GLM-5.2-Q3_K-Q8_0-00001-of-00008.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Use Docker
docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
- LM Studio
- Jan
- vLLM
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SixVolts/GLM-5.2-ewaste-edition-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SixVolts/GLM-5.2-ewaste-edition-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
- Ollama
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Ollama:
ollama run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
- Unsloth Studio
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting
- Pi
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Docker Model Runner:
docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
- Lemonade
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
Run and chat with the model
lemonade run user.GLM-5.2-ewaste-edition-GGUF-Q8_0
List all available models
lemonade list
GLM-5.2 โ Q3_K / Q8_0 GGUF (CPU-expert quant for older hardware)
GGUF quantization of zai-org/GLM-5.2 (745B total / 40B active, glm-dsa MoE with MLA attention) built for one specific job: fast CPU-expert MoE inference on older / "e-waste" hardware โ a dual-socket Xeon (or similar) with lots of RAM and a single modest GPU.
The routed experts are quantized to Q3_K (a plain K-quant) instead of the codebook IQ-quants used by the popular "dynamic" packs, because K-quant dequant is much faster on pre-AVX-512 CPUs โ and on a CPU-expert setup that dequant is the decode bottleneck. Same size, equal-or-better quality, faster decode. See below.
Quick start
Routed experts in system RAM, attention + KV on one GPU. ngram speculative decoding is on by default โ it's a large win on repetitive / code / structured output and harmless on prose (see Speculative decoding):
GGML_CUDA_NO_PINNED=1 numactl --interleave=all \
./llama-server \
--model GLM-5.2-Q3_K-Q8_0-00001-of-0000N.gguf \
-ngl 999 -ot 'ffn_.*_exps=CPU' \
-fa on -ctk q8_0 -ctv q8_0 -c 16384 \
-t 42 \
--spec-type ngram-cache \
--jinja
-ot 'ffn_.*_exps=CPU'keeps the 256 routed experts in RAM while attention, KV, the shared expert and the router stay on the GPU (~18 GB VRAM). The shared expert fires every token โ it belongs in HBM, not on the CPU path. Equivalents:--cpu-moe(same placement, one flag);--n-cpu-moe N(keep only the first N layers' experts on CPU, offload the rest if you have spare VRAM).-t 42โ physical cores + some hyperthreads on a 2ร14c box โ tune to your core count.GGML_CUDA_NO_PINNED=1avoids pinning ~300 GB of host memory;numactl --interleave=allhelps on dual-socket boxes.- All-CPU (no GPU) also works: drop the GPU flags and use
-ngl 0. - Runs on llama.cpp and ik_llama.cpp (for
ik_llama.cpp, swap the speculator:--spec-type suffix).
Do not add
--run-time-repack/-rtr. It allocates a second full copy of the model in RAM, which page-thrashes (and can OOM) on the RAM-constrained boxes this quant targets, for no decode benefit at batch=1.
Speculative decoding
This is a 256-expert / 8-active MoE with the experts on the CPU, which makes the kind of speculation matter a lot.
ngram (recommended, on by default)
--spec-type ngram-cache (llama.cpp) / --spec-type suffix (ik_llama.cpp) drafts tokens from the recent context โ no draft model, no extra weights. On repeated spans the drafted tokens route to the same experts already in the verify batch, so the MoE verify cost doesn't blow up. It's workload-gated โ it fires only when the output actually repeats:
| output type | speedup vs no-spec |
|---|---|
| verbatim / highly repetitive | +80 % |
| CSV / structured records | +52 % |
| boilerplate / templated code | +37 % |
| general prose | ~0 % (never fires) |
| novel, non-repetitive code | โ5 % |
On the reference box that's roughly 3.5 โ 5โ7 tok/s on agentic / code-echo / templated workloads. It costs ~5 % on novel code (it drafts on partial matches that miss), so if your traffic is purely free-form prose/code you can drop --spec-type ngram-cache; for agentic, tool-use, refactoring, structured-output, or any repetitive workload, leave it on.
MTP โ DO NOT use on this model
The GGUF retains the nextn / MTP head (blk.N.nextn.*) and both engines support --spec-type draft-mtp, but it is a hard loss here: ~โ50 % at every draft depth, even at 100 % draft acceptance. The batched verify of N drafted tokens reads the union of their routed experts (8 active ร N mostly-disjoint sets), so each speculated token ~doubles the per-step expert traffic โ acceptance can't pay that back. Don't enable MTP for this MoE.
Why Q3_K experts
The popular "dynamic" Q3 packs (e.g. UD-Q3_K_XL) quantize the routed experts with IQ3_XXS / IQ4_XS โ codebook quants that are excellent for size-per-quality, but whose dequant relies on a 256-entry grid gather. On pre-AVX-512 CPUs (Haswell/Broadwell and older) there is no fast gather โ it's emulated โ so on a CPU-expert setup that codebook dequant becomes the decode bottleneck.
This quant uses Q3_K for the experts (shift/mask + a block scale โ no codebook, no gather) and Q8_0 for everything else. At the same size it's measurably faster to decode with equal-or-better quality, and the gap widens on weaker/older CPUs.
Composition
| tensors | type |
|---|---|
routed experts (ffn_*_exps) |
Q3_K (~3.44 bpw) |
| MLA attention, shared expert, dense FFN, token/output embeddings, norms | Q8_0 |
- imatrix: computed on wikitext-2 (200 ร 512-token chunks), applied to the experts.
- Size: ~291 GiB. Needs ~300 GB RAM for the experts (plus a GPU or more RAM for attention/KV).
Quality โ perplexity (matches the IQ3 pack, edge to this one)
wikitext-2 test, ctx 512, 100 chunks, identical all-CPU config:
| quant | experts | PPL โ |
|---|---|---|
unsloth/UD-Q3_K_XL |
IQ3_XXS / IQ4_XS | 2.8784 ยฑ 0.036 |
| this quant โ RTN (no imatrix) | Q3_K | 2.8798 ยฑ 0.036 |
| this quant โ imatrix | Q3_K + imatrix | 2.8265 ยฑ 0.035 |
The imatrix Q3_K/Q8_0 quant matches and slightly beats the IQ3 dynamic pack at the same size, and the imatrix is a clean improvement over RTN (2.880 โ 2.827).
Decode speed (the +5 %)
Reference box: 2ร Xeon E5-2690 v4 (14c each, AVX2, no AVX-512), 1ร MI100 holding attention + KV, all 256 experts in system RAM:
| quant | decode tok/s โ |
|---|---|
unsloth/UD-Q3_K_XL (IQ experts) |
3.35 |
| this quant (Q3_K experts) | 3.53 (+5 %) |
The +5 % is purely the cheaper expert dequant (K-quant shift/mask vs IQ codebook gather); it grows on older CPUs with weaker gather. Decode is CPU-dequant-bound, so it scales with memory bandwidth and core count โ AVX-512 / more memory channels go faster. (And --spec-type ngram-cache adds the speculation uplift on top.)
Build provenance
zai-org/GLM-5.2 (BF16 safetensors) โ convert_hf_to_gguf.py (glm-dsa) โ Q8_0 GGUF โ llama-quantize --allow-requantize --imatrix <wikitext.imatrix> --custom-q 'ffn_.*_exps=q3_K' <q8_0> <out> Q8_0.
ik_llama.cpp loader note: some builds require the DSA-indexer tensors (blk.N.indexer.*), but mainline's converter writes them as optional and only on a subset of layers. They are loaded-but-unused in inference, so if loading fails with check_tensor_dims: ...indexer.k_norm.weight not found, mark those create_tensor calls TENSOR_NOT_REQUIRED.
- Downloads last month
- 129
8-bit
Model tree for SixVolts/GLM-5.2-ewaste-edition-GGUF
Base model
zai-org/GLM-5.2