hipfire DeepSeek V4 Flash (mq2lloyd)

A mixed-precision packaging of deepseek-ai/DeepSeek-V4-Flash (284 B total / 13 B active per upstream) for the hipfire Rust-native inference engine on AMD RDNA GPUs.

Upstream is shipped in FP8 (e4m3) with FP4 routed-expert weights. This packaging rewrites the dominant weight class β€” the 256 routed MoE experts per layer β€” as 2-bit MagnumQuant-Lloyd (MQ2-Lloyd), and keeps everything else as Q8F16 or F16. The container suffix .mq2lloyd names that dominant class, but the file is not a uniform 2-bit dump.

The file format is hipfire's HFQ container and is not GGUF / safetensors / AWQ compatible β€” it only loads in hipfire.

What's inside (verified by enumerating the file's tensor table)

The 86.2 GB file contains 34 223 tensors. By hipfire QuantType enum:

  33024  qt=19 (MQ2G256Lloyd)        – 256 routed experts Γ— 3 (w1/w2/w3) Γ— 43 layers
    389  qt=3  (Q8F16)               – shared experts, main attn, embed, lm_head, router gates
    807  qt=1  (F16)                 – norms, compressor, indexer, hc_*, attn_sink
      3  qt=22 (TidI32)              – hash-router fast-path (3 layers only)
Tensor class Storage Weights (this file) Storage bytes
Routed MoE experts MQ2G256Lloyd, 2.25 bpw (G=256 β†’ 64 B 2-bit indices + 8 B fp16 Lloyd codebook = 72 B/group) 277.025 B 77.913 GB
Shared expert + main attention (wq_a/b, wkv, wo_a/b) + embed + lm_head + MoE router gates Q8F16, 8.5 bpw (GGML Q8_0 block: 2 B F16 scale + 32 B Q8 data per 32 weights = 34 B / 32 w) 6.785 B 7.209 GB
Compressor / indexer matrices, RMSNorm scales, attn_sink, HC gating (hc_attn_*, hc_ffn_*, hc_head_*) F16, 16 bpw 0.522 B 1.043 GB
tid2eid hash-router table (layers 0–2 only β€” num_hash_layers = 3) TidI32, 32 bpw 2.3 M 9.3 MB
Total mixed β†’ 2.425 bpw avg 284.335 B 86.175 GB weights + 9.2 MB header/metadata = 86.184 GB on disk (estimate matches the actual file to within 0.01 %)

The runtime SWA K and V state (window = 128) is F32 and is allocated live at session start β€” it is not persisted in the file.

Files

File Size Purpose
deepseek-v4-flash.mq2lloyd 86,184,307,283 B (β‰ˆ86.2 GB) Main model β€” 43 layers, 256 routed + 1 shared expert per layer, attention, embed, lm_head
deepseek-v4-flash-mtp.mq2lloyd 1,998,047,355 B (β‰ˆ2.0 GB) Single MTP layer (num_nextn_predict_layers = 1), opened automatically when present alongside the main file. Used for optional speculative decode.

Loading

This model is registered in hipfire's CLI registry under deepseek-v4-flash (aliases: deepseek4, deepseek-v4). The registry entry pulls both the main file and the MTP companion together.

# 1. Install hipfire (one-time):
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# 2. Pull the model (downloads both .mq2lloyd files into ~/.hipfire/models/):
hipfire pull deepseek-v4-flash

# 3. Run it:
hipfire run deepseek-v4-flash "Write a fibonacci function in C"
#    or interactive chat:
hipfire run deepseek-v4-flash
#    or expose an OpenAI-compatible HTTP server on :11435:
hipfire serve

hipfire run auto-pulls if the files aren't present, so step 2 is optional. Both HIPFIRE_DEEPSEEK4_MOE and HIPFIRE_DEEPSEEK4_UPLOAD_EXPERTS default to on at the engine level β€” no env vars are required.

From source (development path)

git clone https://github.com/Kaden-Schutt/hipfire
cd hipfire
cargo build --release -p hipfire-arch-deepseek4 --example deepseek4_chat

# Pull the files (uses `hf` CLI or hipfire's registry-driven pull above):
hf download nwoolmer/hipfire-deepseek-v4-flash \
    deepseek-v4-flash.mq2lloyd deepseek-v4-flash-mtp.mq2lloyd \
    --local-dir ~/.hipfire/models/

# Direct chat binary (DSML chat template, EOS stop, multi-turn KV):
HIPFIRE_DEEPSEEK4_MODEL=~/.hipfire/models/deepseek-v4-flash.mq2lloyd \
    ./target/release/examples/deepseek4_chat

For programmatic access, crates/hipfire-runtime/examples/daemon.rs exposes the engine as a JSON-lines IPC service over stdin/stdout (spawned by the Bun CLI front-end); it dispatches DeepSeek V4 Flash via arch_id = 9.

Architecture

The packaging follows the V4F architecture as described in the DeepSeek V4 paper. Every value below is read from the config JSON embedded in the HFQ file's metadata blob:

Field Value Notes
architectures ["DeepseekV4ForCausalLM"] upstream class
num_hidden_layers 43
hidden_size 4096
vocab_size 129 280
num_attention_heads / num_key_value_heads / head_dim 64 / 1 / 512 64 query heads; KV is a single latent stream of dim 512 (MLA, joint K+V via wkv [512, 4096])
q_lora_rank / o_lora_rank / o_groups 1024 / 1024 / 8 Q low-rank factorisation wq_a [1024,4096] β†’ wq_b [32768,1024]; grouped O projection wo_a [8192,4096] β†’ wo_b [4096,8192] with intermediate = o_groups Β· o_lora_rank = 8192
qk_rope_head_dim 64 tail-split RoPE: only the last 64 channels per head get rotated
n_routed_experts / n_shared_experts / num_experts_per_tok 256 / 1 / 6 top-6 routing per token
moe_intermediate_size 2048 per-expert width
hc_mult / hc_sinkhorn_iters / hc_eps 4 / 20 / 1e-6 4-stream Hyper-Connections, 20-iter Sinkhorn normalisation
index_n_heads / index_head_dim / index_topk 64 / 128 / 512 indexer-gated attention: top-512 over compressed-K
compress_ratios [0, 0, 4, 128, 4, 128, …, 4, 0] (len 43) layers 0–1 are dense; compressed-KV attention from layer 2 onward
num_hash_layers 3 first three layers carry the tid2eid router fast-path
num_nextn_predict_layers 1 one MTP head (in the companion file)
sliding_window 128 SWA window for the main attention path
rope_theta / compress_rope_theta 10 000 / 160 000
rope_scaling YaRN, factor 16, original_max 65 536 β†’ max_position 1 048 576
expert_dtype (upstream) / quantization_config fp4 / fp8 e4m3 [128,128] upstream's quant; superseded here by MQ2-Lloyd + Q8F16 + F16

Tensor-presence cross-check against the file (also verified above):

  • Layers 0 and 1: dense attention (wq_a/b, wkv, wo_a/b only β€” no compressor / no indexer).
  • Layers 2–42: compressed-KV attention (every layer has attn.compressor.{wkv,wgate,ape,norm}).
  • Even layers 2, 4, …, 42 (21 layers): also carry the indexer block (attn.indexer.{wq_b, weights_proj, compressor.*}).
  • Layers 0, 1, 2: also ship the ffn.gate.tid2eid hash-router fast-path table.
  • The MTP companion file contains a single layer (mtp.0.*) with its own attention block, 256 routed experts, a shared expert, a full mHC stack (hc_attn_*, hc_ffn_*, hc_head_*), and input-side projection/norm tensors (e_proj, h_proj, enorm, hnorm) used by the MTP head's input mixing.

Performance

Measured 2026-05-28 on AMD Radeon 8060S (gfx1151, Strix Halo APU, 128 GB UMA), hipfire at tag v0.2.0 (commit 3d456e5c), ROCm 7.2.1, SWA attention path, temp=0.7 top_k=40, prompt_normalize on.

Mode Throughput How measured
Plain decode (TG) ~13.9 tok/s 13 warm-turn measurements across two chat processes (turn β‰₯ 2 of each, both 256-tok and 16-tok generations): median 13.86, range 13.61 – 14.00, Οƒ β‰ˆ 0.7 %.
Batched prefill (PP) ~55 tok/s 1 235-token fresh-KV prefill of a single-chunk system prompt (no internal blank lines). HIPFIRE_DEEPSEEK4_PP_BATCH default 1024. Sub-100-token chat-turn prompts are overhead-dominated (~40 tok/s) and are not the right point to cite.
Spec decode (MTP, K=3) 14–19 tok/s Median 16.4 tok/s, +19 % over plain. Draft accept rate ranged 41–62 % across the three turns of the same chat (highest on direct code generation, lowest on conversational follow-ups). Enable with HIPFIRE_DEEPSEEK4_SPEC_DECODE=1 HIPFIRE_DEEPSEEK4_SPEC_K=3.

Cold-process load β€” weight upload from both HFQ files β€” is ~44 s on the 8060S (measured by an inline timestamping wrapper from process start until the engine prints DeepSeek V4 ready.). Plain decode is DRAM-bandwidth-bound; faster memory (RDNA3 desktop GDDR6, future RDNA4) scales decode roughly proportionally.

Compatibility

  • GPU: AMD RDNA3 / RDNA3.5 with HIP + WMMA. Validated on gfx1151 (Radeon 8060S, Strix Halo) for this build of the V4F weights. The engine has gfx1100 (RX 7900-class) kernels in tree but they were not exercised against this specific file; RDNA1/2 (gfx1010 / gfx103x) and gfx12 are tracked targets of the broader hipfire project but not recommended for running this model.
  • OS: Linux with amdgpu kernel driver. Built and measured against ROCm 7.2.1 / HIP 7.2.
  • Memory: 86.2 GB for the main file + ~2 GB for the MTP companion + several GB working set during decode. Strix Halo class systems with 128 GB UMA are the comfortable target; discrete-GPU configs need ~96 GB+ of VRAM.
  • Tested context length: end-to-end inference was exercised at prompts up to ~1.2 k tokens and generations up to 256 tokens. Upstream's stated max position (1 048 576 via YaRN) was not exercised here.

License

The upstream model (deepseek-ai/DeepSeek-V4-Flash) is MIT-licensed; the weights in this packaging inherit those terms.

The hipfire engine that produced and consumes this format is dual-licensed MIT / Apache-2.0 at the user's option (see LICENSE-MIT, LICENSE-APACHE, and NOTICE in the engine repo).

Acknowledgements

  • DeepSeek AI β€” original DeepSeek V4 Flash weights and architecture.
  • Salvatore Sanfilippo (antirez) β€” DwarfStar, a focused C/Metal/CUDA/ROCm reference engine for DeepSeek V4 Flash. The hipfire impl cross-validates against it for MTP wiring, HC reduction, and KV layout.

Citation

If you use this packaging, please cite the upstream model, the hipfire engine (per its CITATION.cff), and this HF release:

@misc{deepseekai2026deepseekv4,
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author = {DeepSeek-AI},
  year   = {2026},
  url    = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}

@software{hipfire,
  title   = {hipfire β€” Rust-native LLM inference for AMD RDNA / CDNA},
  author  = {Schutt, Kaden},
  year    = {2026},
  version = {0.2.0},
  url     = {https://github.com/Kaden-Schutt/hipfire}
}

@misc{hipfire-deepseek-v4-flash,
  title  = {hipfire DeepSeek V4 Flash (MQ2-Lloyd)},
  author = {Woolmer, Nick},
  year   = {2026},
  url    = {https://huggingface.co/nwoolmer/hipfire-deepseek-v4-flash}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nwoolmer/hipfire-deepseek-v4-flash

Quantized
(65)
this model