hipfire DeepSeek V4 Flash (mq2lloyd)

A mixed-precision packaging of deepseek-ai/DeepSeek-V4-Flash (284 B total / 13 B active per upstream) for the hipfire Rust-native inference engine on AMD RDNA GPUs.

Upstream is shipped in FP8 (e4m3) with FP4 routed-expert weights. This packaging rewrites the dominant weight class — the 256 routed MoE experts per layer — as 2-bit MagnumQuant-Lloyd (MQ2-Lloyd), and keeps everything else as Q8F16 or F16. The container suffix .mq2lloyd names that dominant class, but the file is not a uniform 2-bit dump.

The file format is hipfire's HFQ container and is not GGUF / safetensors / AWQ compatible — it only loads in hipfire.

What's inside (verified by enumerating the file's tensor table)

The 86.2 GB file contains 34 223 tensors. By hipfire QuantType enum:

  33024  qt=19 (MQ2G256Lloyd)        – 256 routed experts × 3 (w1/w2/w3) × 43 layers
    389  qt=3  (Q8F16)               – shared experts, main attn, embed, lm_head, router gates
    807  qt=1  (F16)                 – norms, compressor, indexer, hc_*, attn_sink
      3  qt=22 (TidI32)              – hash-router fast-path (3 layers only)

Tensor class	Storage	Weights (this file)	Storage bytes
Routed MoE experts	MQ2G256Lloyd, 2.25 bpw (G=256 → 64 B 2-bit indices + 8 B fp16 Lloyd codebook = 72 B/group)	277.025 B	77.913 GB
Shared expert + main attention (wq_a/b, wkv, wo_a/b) + embed + lm_head + MoE router gates	Q8F16, 8.5 bpw (GGML Q8_0 block: 2 B F16 scale + 32 B Q8 data per 32 weights = 34 B / 32 w)	6.785 B	7.209 GB
Compressor / indexer matrices, RMSNorm scales, attn_sink, HC gating (`hc_attn_`, `hc_ffn_`, `hc_head_*`)	F16, 16 bpw	0.522 B	1.043 GB
`tid2eid` hash-router table (layers 0–2 only — `num_hash_layers = 3`)	TidI32, 32 bpw	2.3 M	9.3 MB
Total	mixed → 2.425 bpw avg	284.335 B	86.175 GB weights + 9.2 MB header/metadata = 86.184 GB on disk (estimate matches the actual file to within 0.01 %)

The runtime SWA K and V state (window = 128) is F32 and is allocated live at session start — it is not persisted in the file.

Files

File	Size	Purpose
`deepseek-v4-flash.mq2lloyd`	86,184,307,283 B (≈86.2 GB)	Main model — 43 layers, 256 routed + 1 shared expert per layer, attention, embed, lm_head
`deepseek-v4-flash-mtp.mq2lloyd`	1,998,047,355 B (≈2.0 GB)	Single MTP layer (`num_nextn_predict_layers = 1`), opened automatically when present alongside the main file. Used for optional speculative decode.

Loading

This model is registered in hipfire's CLI registry under deepseek-v4-flash (aliases: deepseek4, deepseek-v4). The registry entry pulls both the main file and the MTP companion together.

# 1. Install hipfire (one-time):
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# 2. Pull the model (downloads both .mq2lloyd files into ~/.hipfire/models/):
hipfire pull deepseek-v4-flash

# 3. Run it:
hipfire run deepseek-v4-flash "Write a fibonacci function in C"
#    or interactive chat:
hipfire run deepseek-v4-flash
#    or expose an OpenAI-compatible HTTP server on :11435:
hipfire serve

hipfire run auto-pulls if the files aren't present, so step 2 is optional. Both HIPFIRE_DEEPSEEK4_MOE and HIPFIRE_DEEPSEEK4_UPLOAD_EXPERTS default to on at the engine level — no env vars are required.

From source (development path)

git clone https://github.com/Kaden-Schutt/hipfire
cd hipfire
cargo build --release -p hipfire-arch-deepseek4 --example deepseek4_chat

# Pull the files (uses `hf` CLI or hipfire's registry-driven pull above):
hf download nwoolmer/hipfire-deepseek-v4-flash \
    deepseek-v4-flash.mq2lloyd deepseek-v4-flash-mtp.mq2lloyd \
    --local-dir ~/.hipfire/models/

# Direct chat binary (DSML chat template, EOS stop, multi-turn KV):
HIPFIRE_DEEPSEEK4_MODEL=~/.hipfire/models/deepseek-v4-flash.mq2lloyd \
    ./target/release/examples/deepseek4_chat

For programmatic access, crates/hipfire-runtime/examples/daemon.rs exposes the engine as a JSON-lines IPC service over stdin/stdout (spawned by the Bun CLI front-end); it dispatches DeepSeek V4 Flash via arch_id = 9.

Architecture

The packaging follows the V4F architecture as described in the DeepSeek V4 paper. Every value below is read from the config JSON embedded in the HFQ file's metadata blob:

Field	Value	Notes
`architectures`	`["DeepseekV4ForCausalLM"]`	upstream class
`num_hidden_layers`	43
`hidden_size`	4096
`vocab_size`	129 280
`num_attention_heads` / `num_key_value_heads` / `head_dim`	64 / 1 / 512	64 query heads; KV is a single latent stream of dim 512 (MLA, joint K+V via `wkv [512, 4096]`)
`q_lora_rank` / `o_lora_rank` / `o_groups`	1024 / 1024 / 8	Q low-rank factorisation `wq_a [1024,4096] → wq_b [32768,1024]`; grouped O projection `wo_a [8192,4096] → wo_b [4096,8192]` with intermediate = `o_groups · o_lora_rank = 8192`
`qk_rope_head_dim`	64	tail-split RoPE: only the last 64 channels per head get rotated
`n_routed_experts` / `n_shared_experts` / `num_experts_per_tok`	256 / 1 / 6	top-6 routing per token
`moe_intermediate_size`	2048	per-expert width
`hc_mult` / `hc_sinkhorn_iters` / `hc_eps`	4 / 20 / 1e-6	4-stream Hyper-Connections, 20-iter Sinkhorn normalisation
`index_n_heads` / `index_head_dim` / `index_topk`	64 / 128 / 512	indexer-gated attention: top-512 over compressed-K
`compress_ratios`	`[0, 0, 4, 128, 4, 128, …, 4, 0]` (len 43)	layers 0–1 are dense; compressed-KV attention from layer 2 onward
`num_hash_layers`	3	first three layers carry the `tid2eid` router fast-path
`num_nextn_predict_layers`	1	one MTP head (in the companion file)
`sliding_window`	128	SWA window for the main attention path
`rope_theta` / `compress_rope_theta`	10 000 / 160 000
`rope_scaling`	YaRN, factor 16, original_max 65 536 → max_position 1 048 576
`expert_dtype` (upstream) / `quantization_config`	`fp4` / `fp8 e4m3 [128,128]`	upstream's quant; superseded here by MQ2-Lloyd + Q8F16 + F16

Tensor-presence cross-check against the file (also verified above):

Layers 0 and 1: dense attention (wq_a/b, wkv, wo_a/b only — no compressor / no indexer).
Layers 2–42: compressed-KV attention (every layer has attn.compressor.{wkv,wgate,ape,norm}).
Even layers 2, 4, …, 42 (21 layers): also carry the indexer block (attn.indexer.{wq_b, weights_proj, compressor.*}).
Layers 0, 1, 2: also ship the ffn.gate.tid2eid hash-router fast-path table.
The MTP companion file contains a single layer (mtp.0.*) with its own attention block, 256 routed experts, a shared expert, a full mHC stack (hc_attn_*, hc_ffn_*, hc_head_*), and input-side projection/norm tensors (e_proj, h_proj, enorm, hnorm) used by the MTP head's input mixing.

Performance

Measured 2026-05-28 on AMD Radeon 8060S (gfx1151, Strix Halo APU, 128 GB UMA), hipfire at tag v0.2.0 (commit 3d456e5c), ROCm 7.2.1, SWA attention path, temp=0.7 top_k=40, prompt_normalize on.

Mode	Throughput	How measured
Plain decode (TG)	~13.9 tok/s	13 warm-turn measurements across two chat processes (turn ≥ 2 of each, both 256-tok and 16-tok generations): median 13.86, range 13.61 – 14.00, σ ≈ 0.7 %.
Batched prefill (PP)	~55 tok/s	1 235-token fresh-KV prefill of a single-chunk system prompt (no internal blank lines). `HIPFIRE_DEEPSEEK4_PP_BATCH` default 1024. Sub-100-token chat-turn prompts are overhead-dominated (~40 tok/s) and are not the right point to cite.
Spec decode (MTP, K=3)	14–19 tok/s	Median 16.4 tok/s, +19 % over plain. Draft accept rate ranged 41–62 % across the three turns of the same chat (highest on direct code generation, lowest on conversational follow-ups). Enable with `HIPFIRE_DEEPSEEK4_SPEC_DECODE=1 HIPFIRE_DEEPSEEK4_SPEC_K=3`.

Cold-process load — weight upload from both HFQ files — is ~44 s on the 8060S (measured by an inline timestamping wrapper from process start until the engine prints DeepSeek V4 ready.). Plain decode is DRAM-bandwidth-bound; faster memory (RDNA3 desktop GDDR6, future RDNA4) scales decode roughly proportionally.

Compatibility

GPU: AMD RDNA3 / RDNA3.5 with HIP + WMMA. Validated on gfx1151 (Radeon 8060S, Strix Halo) for this build of the V4F weights. The engine has gfx1100 (RX 7900-class) kernels in tree but they were not exercised against this specific file; RDNA1/2 (gfx1010 / gfx103x) and gfx12 are tracked targets of the broader hipfire project but not recommended for running this model.
OS: Linux with amdgpu kernel driver. Built and measured against ROCm 7.2.1 / HIP 7.2.
Memory: 86.2 GB for the main file + ~2 GB for the MTP companion + several GB working set during decode. Strix Halo class systems with 128 GB UMA are the comfortable target; discrete-GPU configs need ~96 GB+ of VRAM.
Tested context length: end-to-end inference was exercised at prompts up to ~1.2 k tokens and generations up to 256 tokens. Upstream's stated max position (1 048 576 via YaRN) was not exercised here.

License

The upstream model (deepseek-ai/DeepSeek-V4-Flash) is MIT-licensed; the weights in this packaging inherit those terms.

The hipfire engine that produced and consumes this format is dual-licensed MIT / Apache-2.0 at the user's option (see LICENSE-MIT, LICENSE-APACHE, and NOTICE in the engine repo).

Acknowledgements

DeepSeek AI — original DeepSeek V4 Flash weights and architecture.
Salvatore Sanfilippo (antirez) — DwarfStar, a focused C/Metal/CUDA/ROCm reference engine for DeepSeek V4 Flash. The hipfire impl cross-validates against it for MTP wiring, HC reduction, and KV layout.

Citation

If you use this packaging, please cite the upstream model, the hipfire engine (per its CITATION.cff), and this HF release:

@misc{deepseekai2026deepseekv4,
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author = {DeepSeek-AI},
  year   = {2026},
  url    = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}

@software{hipfire,
  title   = {hipfire — Rust-native LLM inference for AMD RDNA / CDNA},
  author  = {Schutt, Kaden},
  year    = {2026},
  version = {0.2.0},
  url     = {https://github.com/Kaden-Schutt/hipfire}
}

@misc{hipfire-deepseek-v4-flash,
  title  = {hipfire DeepSeek V4 Flash (MQ2-Lloyd)},
  author = {Woolmer, Nick},
  year   = {2026},
  url    = {https://huggingface.co/nwoolmer/hipfire-deepseek-v4-flash}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for nwoolmer/hipfire-deepseek-v4-flash

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(65)

this model