openPangu-2.0-Flash GGUF

GGUF conversion of openPangu-2.0-Flash (92B MoE, ~6B active parameters, 512K context), converted directly from the original bf16 safetensors.

These files require a llama.cpp fork with openPangu support: https://github.com/mrexodia/llama.cpp-openPangu-2.0-Flash Upstream llama.cpp cannot load this architecture yet.

Supported by the fork: MLA attention, DSA sparse attention (lightning indexer top-2048) on the global layers, per-layer sliding-window attention, manifold hyper-connections (mHC), MoME convolutions, learned attention sinks, tool calling + <think> reasoning parsing, and optional multi-token-prediction (MTP) self-speculative decoding.

Files

File Size Notes
openPangu-2.0-Flash-base-Q3_K_M.gguf 42 GB fits 64 GB Apple Silicon
openPangu-2.0-Flash-base-Q4_K_M.gguf 52 GB recommended for speed
openPangu-2.0-Flash-base-Q8_0.gguf 91 GB recommended for quality (fits DGX Spark)
openPangu-2.0-Flash-base-BF16.gguf 183 GB requant source
openPangu-2.0-Flash-mtp-Q8_0.gguf 9.2 GB optional MTP draft head
openPangu-2.0-Flash-mtp-BF16.gguf 19 GB requant source

The base files omit the 3 MTP (NextN) layers; the mtp files contain only them, for use as a speculative draft model.

Measured perplexity (clean English prose, -c 2048): Q4_K_M 3.46, Q3_K_M 3.70. Needle-in-a-haystack retrieval validated to 100K tokens; tool calling verified against the OpenAI-compatible server API.

Running on a DGX Spark (GB10)

git clone https://github.com/mrexodia/llama.cpp-openPangu-2.0-Flash
cd llama.cpp-openPangu-2.0-Flash
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --target llama-server

build/bin/llama-server -m openPangu-2.0-Flash-base-Q8_0.gguf -c 65536 --jinja

Context can be raised up to -c 524288 (the compressed MLA KV cache stays small: roughly 12 GB at the full 512K).

Measured performance (DGX Spark, Q4_K_M)

short @10K @24K @100K
Prompt processing 770 t/s¹ 666 t/s 548 t/s 275 t/s
Generation ~25 t/s 23.1 t/s 22.1 t/s 18.5 t/s

¹ llama-bench pp512; the depth columns are measured through llama-server chat requests (needle-in-a-haystack prompts), so they include sampling and per-request overhead. Raw decode measures 38 t/s (llama-bench tg128). Q8_0 runs at roughly two thirds of the Q4 speed.

Performance holds up at depth because the fork ships fused CUDA kernels for the model's hyper-connection layers and its DSA sparse attention: fused indexer scoring at both prefill and decode, radix-select top-k, and gather-based decode attention over only the top-2048 selected tokens.

Optional MTP speculative decoding (mainly benefits discrete GPUs; on bandwidth-bound unified-memory devices it is usually a small net loss):

build/bin/llama-server -m openPangu-2.0-Flash-base-Q8_0.gguf \
    -md openPangu-2.0-Flash-mtp-Q8_0.gguf --mtp -c 65536 --jinja

Apple Silicon (64 GB)

Use Q3_K_M and raise the Metal wired-memory limit before loading:

sudo sysctl iogpu.wired_limit_mb=57344
build/bin/llama-server -m openPangu-2.0-Flash-base-Q3_K_M.gguf -c 32768 --jinja

License

The model weights are subject to the openPangu license; this repository redistributes them in converted form under the same terms.

Downloads last month
45
GGUF
Model size
92B params
Architecture
openpangu-v2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrexodia/openPangu-2.0-Flash-GGUF

Quantized
(5)
this model