DeepSeek-V4-Flash REAP25 LCB50 DS4 GGUF

Experimental DS4 compact GGUF made by applying 25% REAP expert pruning to a DeepSeek-V4-Flash DS4 GGUF.

Model file:

DeepSeek-V4-Flash-REAP25-LCB50-DS4-compact-IQ2XXS.gguf

Bundled runtime:

ds4_reap_runtime/

Compatibility

This model needs the bundled REAP-aware DS4 runtime, or another DS4 build that supports ds4-compact-v1.

It is not expected to run with stock DS4, llama.cpp, Ollama, LM Studio, or other generic GGUF loaders. The routed expert tensors are physically compacted, so the runtime must read the REAP metadata and route into compact expert ids.

Expected DS4 runtime line:

REAP runtime metadata enabled: hash_preserved=3 router_masked=40 moe_disabled=0 layout=ds4-compact-v1

How It Was Made

Source GGUF:

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

Calibration:

  • Dataset: LiveCodeBench
  • Selected samples: 50
  • Sampling: balanced random by difficulty
  • Seed: 42
  • Distribution: easy 17, medium 17, hard 16
  • Observed prompt tokens: 26386
  • Observed routed expert selections: 6807588

Pruning:

  • Layers 0-2: preserved, hash-routed
  • Layers 3-42: REAP-pruned
  • Compression ratio: 0.25
  • Experts per pruned layer: 256 -> 192
  • Top-k remains 6
  • Expert tensor bytes are copied directly, preserving source quantization

Size:

source file: 80.76 GiB / 86.72 GB
REAP25 file: 63.87 GiB / 68.58 GB

Local Metal mapping at --ctx 512:

source mapped: 82697.67 MiB
REAP25 mapped: 65397.66 MiB
saved: ~17300 MiB, about 16.9 GiB

Run With Bundled Runtime

The Metal runtime loads shader source files from metal/*.metal, so run from inside the bundled runtime directory:

cd ds4_reap_runtime

./ds4 \
  -m ../DeepSeek-V4-Flash-REAP25-LCB50-DS4-compact-IQ2XXS.gguf \
  --ctx 512 --nothink --temp 0 -n 64 \
  -p 'stack and queue python code'

For OpenAI-compatible local serving:

cd ds4_reap_runtime

./ds4-server \
  -m ../DeepSeek-V4-Flash-REAP25-LCB50-DS4-compact-IQ2XXS.gguf \
  --ctx 32768 --tokens 1024 \
  --host 127.0.0.1 --port 8000

Notes

This is a 50-sample coding-domain calibration artifact, not a full benchmarked release. It is mainly for testing DS4-native REAP compaction and runtime support.

Downloads last month
125
GGUF
Model size
220B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support