Instructions to use SixVolts/GLM-5.2-ewaste-edition-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SixVolts/GLM-5.2-ewaste-edition-GGUF",
	filename="GLM-5.2-Q3_K-Q8_0-00001-of-00008.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Use Docker

docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SixVolts/GLM-5.2-ewaste-edition-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SixVolts/GLM-5.2-ewaste-edition-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Ollama
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Ollama:
```
ollama run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
```

Unsloth Studio

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SixVolts/GLM-5.2-ewaste-edition-GGUF to start chatting

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Docker Model Runner:
```
docker model run hf.co/SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0
```

Lemonade

How to use SixVolts/GLM-5.2-ewaste-edition-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SixVolts/GLM-5.2-ewaste-edition-GGUF:Q8_0

Run and chat with the model

lemonade run user.GLM-5.2-ewaste-edition-GGUF-Q8_0

List all available models

lemonade list

GLM-5.2 — Q3_K / Q8_0 GGUF (CPU-expert quant for older hardware)

GGUF quantization of zai-org/GLM-5.2 (745B total / 40B active, glm-dsa MoE with MLA attention) built for one specific job: fast CPU-expert MoE inference on older / "e-waste" hardware — a dual-socket Xeon (or similar) with lots of RAM and a single modest GPU.

The routed experts are quantized to Q3_K (a plain K-quant) instead of the codebook IQ-quants used by the popular "dynamic" packs, because K-quant dequant is much faster on pre-AVX-512 CPUs — and on a CPU-expert setup that dequant is the decode bottleneck. Same size, equal-or-better quality, faster decode. See below.

Quick start

Routed experts in system RAM, attention + KV on one GPU. ngram speculative decoding is on by default — it's a large win on repetitive / code / structured output and harmless on prose (see Speculative decoding):

GGML_CUDA_NO_PINNED=1 numactl --interleave=all \
./llama-server \
  --model GLM-5.2-Q3_K-Q8_0-00001-of-0000N.gguf \
  -ngl 999 -ot 'ffn_.*_exps=CPU' \
  -fa on -ctk q8_0 -ctv q8_0 -c 16384 \
  -t 42 \
  --spec-type ngram-cache \
  --jinja

-ot 'ffn_.*_exps=CPU' keeps the 256 routed experts in RAM while attention, KV, the shared expert and the router stay on the GPU (~18 GB VRAM). The shared expert fires every token — it belongs in HBM, not on the CPU path. Equivalents: --cpu-moe (same placement, one flag); --n-cpu-moe N (keep only the first N layers' experts on CPU, offload the rest if you have spare VRAM).
-t 42 ≈ physical cores + some hyperthreads on a 2×14c box — tune to your core count.
GGML_CUDA_NO_PINNED=1 avoids pinning ~300 GB of host memory; numactl --interleave=all helps on dual-socket boxes.
All-CPU (no GPU) also works: drop the GPU flags and use -ngl 0.
Runs on llama.cpp and ik_llama.cpp (for ik_llama.cpp, swap the speculator: --spec-type suffix).

Do not add --run-time-repack / -rtr. It allocates a second full copy of the model in RAM, which page-thrashes (and can OOM) on the RAM-constrained boxes this quant targets, for no decode benefit at batch=1.

Speculative decoding

This is a 256-expert / 8-active MoE with the experts on the CPU, which makes the kind of speculation matter a lot.

ngram (recommended, on by default)

--spec-type ngram-cache (llama.cpp) / --spec-type suffix (ik_llama.cpp) drafts tokens from the recent context — no draft model, no extra weights. On repeated spans the drafted tokens route to the same experts already in the verify batch, so the MoE verify cost doesn't blow up. It's workload-gated — it fires only when the output actually repeats:

output type	speedup vs no-spec
verbatim / highly repetitive	+80 %
CSV / structured records	+52 %
boilerplate / templated code	+37 %
general prose	~0 % (never fires)
novel, non-repetitive code	−5 %

On the reference box that's roughly 3.5 → 5–7 tok/s on agentic / code-echo / templated workloads. It costs ~5 % on novel code (it drafts on partial matches that miss), so if your traffic is purely free-form prose/code you can drop --spec-type ngram-cache; for agentic, tool-use, refactoring, structured-output, or any repetitive workload, leave it on.

MTP — DO NOT use on this model

The GGUF retains the nextn / MTP head (blk.N.nextn.*) and both engines support --spec-type draft-mtp, but it is a hard loss here: ~−50 % at every draft depth, even at 100 % draft acceptance. The batched verify of N drafted tokens reads the union of their routed experts (8 active × N mostly-disjoint sets), so each speculated token ~doubles the per-step expert traffic — acceptance can't pay that back. Don't enable MTP for this MoE.

Why Q3_K experts

The popular "dynamic" Q3 packs (e.g. UD-Q3_K_XL) quantize the routed experts with IQ3_XXS / IQ4_XS — codebook quants that are excellent for size-per-quality, but whose dequant relies on a 256-entry grid gather. On pre-AVX-512 CPUs (Haswell/Broadwell and older) there is no fast gather — it's emulated — so on a CPU-expert setup that codebook dequant becomes the decode bottleneck.

This quant uses Q3_K for the experts (shift/mask + a block scale — no codebook, no gather) and Q8_0 for everything else. At the same size it's measurably faster to decode with equal-or-better quality, and the gap widens on weaker/older CPUs.

Composition

tensors	type
routed experts (`ffn_*_exps`)	Q3_K (~3.44 bpw)
MLA attention, shared expert, dense FFN, token/output embeddings, norms	Q8_0

imatrix: computed on wikitext-2 (200 × 512-token chunks), applied to the experts.
Size: ~291 GiB. Needs ~300 GB RAM for the experts (plus a GPU or more RAM for attention/KV).

Quality — perplexity (matches the IQ3 pack, edge to this one)

wikitext-2 test, ctx 512, 100 chunks, identical all-CPU config:

quant	experts	PPL ↓
`unsloth/UD-Q3_K_XL`	IQ3_XXS / IQ4_XS	2.8784 ± 0.036
this quant — RTN (no imatrix)	Q3_K	2.8798 ± 0.036
this quant — imatrix	Q3_K + imatrix	2.8265 ± 0.035

The imatrix Q3_K/Q8_0 quant matches and slightly beats the IQ3 dynamic pack at the same size, and the imatrix is a clean improvement over RTN (2.880 → 2.827).

Decode speed (the +5 %)

Reference box: 2× Xeon E5-2690 v4 (14c each, AVX2, no AVX-512), 1× MI100 holding attention + KV, all 256 experts in system RAM:

quant	decode tok/s ↑
`unsloth/UD-Q3_K_XL` (IQ experts)	3.35
this quant (Q3_K experts)	3.53 (+5 %)

The +5 % is purely the cheaper expert dequant (K-quant shift/mask vs IQ codebook gather); it grows on older CPUs with weaker gather. Decode is CPU-dequant-bound, so it scales with memory bandwidth and core count — AVX-512 / more memory channels go faster. (And --spec-type ngram-cache adds the speculation uplift on top.)

Build provenance

zai-org/GLM-5.2 (BF16 safetensors) → convert_hf_to_gguf.py (glm-dsa) → Q8_0 GGUF → llama-quantize --allow-requantize --imatrix <wikitext.imatrix> --custom-q 'ffn_.*_exps=q3_K' <q8_0> <out> Q8_0.

ik_llama.cpp loader note: some builds require the DSA-indexer tensors (blk.N.indexer.*), but mainline's converter writes them as optional and only on a subset of layers. They are loaded-but-unused in inference, so if loading fails with check_tensor_dims: ...indexer.k_norm.weight not found, mark those create_tensor calls TENSOR_NOT_REQUIRED.

Downloads last month: 129

GGUF

Model size

753B params

Architecture

glm-dsa

Hardware compatibility

8-bit

Model tree for SixVolts/GLM-5.2-ewaste-edition-GGUF

Base model

zai-org/GLM-5.2

Quantized

(62)

this model