EPIC-Quant for Gemma 4 E4B
CPU-first reference implementation of three layers-aware compression
pillars for Google's gemma-4-E4B (8 B parameters, 4.5 B effective
with PLE, 42 layers, hybrid sliding-window + global attention with
p-RoPE, dense, no MTP). Measured against the actual safetensors on
disk, no synthetic weights.
Status: research artifact, not a production inference engine. This is a measurement harness with real numbers. It is suitable for reproducing the measurements, discussion, and as a starting point for a real deployment (see "What's not here" below).
What this is
Three pillars, each implemented and benchmarked end-to-end:
- Layer-type-aware weight quantization โ sliding-attn
q/k/v/oquantize at one bit budget, global-attnq/k/v/oat another, MLP and PLE companions at a third. Packed bytes are reported as the real on-RAM cost. - PLE (Per-Layer Embedding) sparse hash โ the 5.27 GB
[262144, 10752]PLE table is sparse-cached with a hot top-K in RAM and per-row mmap reads for cold tokens. Measured 86% hot hit rate on a realistic 85/15 workload. - p-RoPE-aware KV cache eviction budget โ sliding layers keep 4-bit rotated / drop 1-bit unrotated; global layers keep 4-bit rotated / drop 2-bit unrotated (because p-RoPE rotates only 25% of the head dim on global). Bit-budget model only โ the packing kernel is a follow-up.
What this is not
- Not a
from_pretrained-able quantized model on HF Hub. - Not a
transformers/vllm/llama.cppplugin. - Not validated against MMLU Pro / MRCR v2 8-needle 128K / Codeforces ELO. The reference measures quant L2 reconstruction error and forward timing, not task quality.
- Not optimized. Forward path uses
F.scaled_dot_product_attentionwith a Python-built mask on CPU. Memory-bandwidth-bound workloads on a real GPU with a fused unpack-and-matmul kernel (Triton / CUTLASS / custom C++) would beat FP16 throughput at 1.58 and 3 bit.
The headline finding
The brief's "1.58-bit ternary on sliding attention" pillar is qualitatively wrong at the proposed bit budget. Measured L2 reconstruction error on the actual E4B weights is >1.0, which means the dequantized weights are mostly noise. The mechanism (compress the low-context layer type) is correct; the bit width is not.
3-bit on sliding attn is the realistic floor. L2 recon drops
from 1.11 โ 0.29 (4ร improvement) for +114 MB of attn weight
(+6%). 4-bit uniform is the safe conservative choice. Full sweep
in COMPARISON.md, full reasoning in
WRITEUP.md.
Repo layout
epic_quant/
__init__.py
layers.py # layer_dims, layer_param_keys
loader.py # MmapSafetensors: lazy v1-safetensors read
packed.py # 2-bit / 3-bit / 4-bit / 16-bit packed weight formats
engine.py # policies + PLECache + KVEvictor + EPICQuantEngine
forward.py # one-block forward (packed quant + real SDPA) on CPU
bench.py # single-policy bench and --sweep 4-policy comparison
build_report.py # turns sweep.json into a markdown table
scripts/
inspect_shapes.py # dumps the safetensors header shapes
probe_header.py # confirms the file is v1 safetensors
COMPARISON.md # 1.58 / 3 / 4 / 16-bit sweep, side-by-side
WRITEUP.md # full architecture writeup, what was built / dropped
LICENSE # Apache 2.0
How to run
# Python 3.10+ with torch, transformers, safetensors, numpy installed.
# CPU is fine; this whole bench runs in 2-5 minutes on a single core.
# 1. Make sure you have a Gemma 4 E4B safetensors somewhere. Either:
# - download via LM Studio (easiest on this box), or
# - python -c "from huggingface_hub import snapshot_download;
# snapshot_download('google/gemma-4-E4B',
# allow_patterns=['*.json','*.safetensors','tokenizer*'])"
# 2. Run the sweep:
$env:PYTHONPATH = "C:\Users\Zwmar\projects\e4b"
python -m epic_quant.bench --sweep --out sweep.json
# 3. Build the human report:
python -m epic_quant.build_report sweep.json COMPARISON.md
Single-policy run (the brief's exact config):
python -m epic_quant.bench --sliding-bits 2 --global-bits 4 --mlp-bits 4 `
--ple-hot 5000 --out bench.json
Measured numbers (real, this box)
All numbers from python -m epic_quant.bench --sweep on the actual
google/gemma-4-E4B safetensors (15.99 GiB on disk), CPU, BF16
end-to-end. 200 tokens, seq_len=16, packed 2/3/4-bit weights.
| Policy | Attn | MLP | PLE companions | PLE hot | Total | Sliding attn L2 |
|---|---|---|---|---|---|---|
| 1.58-bit (brief) | 207 MB | 1.65 GB | 28 MB | 108 MB | 1.99 GB | 1.11 |
| 3-bit | 322 MB | 1.65 GB | 28 MB | 108 MB | 2.11 GB | 0.29 |
| 4-bit uniform | 322 MB | 1.65 GB | 28 MB | 108 MB | 2.11 GB | 0.17 |
| 16-bit (no quant) | 1.28 GB | 6.61 GB | 110 MB | 108 MB | 8.11 GB | 0.00 |
PLE full on disk is 5.27 GB. PLE sparse hash is the second big win (5.27 GB โ 108 MB hot table) and is policy-independent. KV cache compression (sliding 4ร, global 5.8ร at the configured bit budget) is the same across all four policies.
What's not here (and why)
- No GPU kernel. CPU-only. Fused unpack-and-matmul on a real GPU is where the throughput win lives.
- No
transformersintegration. This is a standalone measurement harness, not a model class. - No quality eval. No WikiText-103 PPL, no MMLU Pro, no MRCR v2 8-needle 128K. Only quant L2 recon and CPU forward time. To make this a real product you would run those evals at 1.58 / 3 / 4 bit and confirm L2 recon is a useful proxy for the published 69.4% MMLU Pro / 25.4% MRCR.
- No KV packing kernel.
KVPolicyis a bit-budget model with theoretical compression ratios. The bytes-on-disk packing is a follow-up. - No RoPE in the reference forward. We skip p-RoPE; a real
deployment would call
transformers'Gemma4RotaryEmbedding. - Dropped from the original brief with reasons documented in
WRITEUP.mdยง1: Epi-Stochastic Fetching (E4B is dense, not MoE), Speculative MTP Prefetching (E4B has no MTP head in config or safetensors).
License
Apache 2.0. See LICENSE. The Gemma 4 E4B weights are
not bundled; they are downloaded at runtime from
huggingface.co/google/gemma-4-E4B and remain subject to Google's
Gemma Terms of Use.