PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
+
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████

██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████ ██████████████

FASTCONTEXT-1.0-4B
4-BIT ROCmFP4 · QWEN3 DENSE 4B · REPO-EXPLORATION SUBAGENT · CODE-WEIGHTED IMATRIX · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
ARCH
QWEN3 DENSE
CONTEXT
256 K
PARAMS
4B DENSE
DRAFT
NO MTP
BACKEND
VULKAN0
LICENSE
MIT

⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of microsoft/FastContext-1.0-4B-SFT — Microsoft's repository-exploration subagent for coding agents. Instead of one model both exploring the repo and solving the task, FastContext is invoked on demand by a main agent, fires parallel read-only tool calls (READ / GLOB / GREP), and returns compact file paths + line ranges as focused context. Architecturally it's a plain Qwen3 dense 4B (Qwen3ForCausalLM, 36 layers, hidden 2560, 256K context, MIT-licensed), here in the custom ROCmFP4 4-bit format, imatrix-quantized.

01 · FILES
File Body Size Pick if
…-STRIX-embF16-imatrix.gguffast2.7 GBthe one build — best speed/quality balance: f16 tied embeddings/head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the quality lever that's actually felt — genuine f16 embeddings (from BF16), which also serve as the output head since the model ties them — on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix (see §04). The Qwen (ChatML) chat template is baked into the GGUF — just pass --jinja.

NOTE // TIED EMBEDDINGS. FastContext has tie_word_embeddings=True, so there's no separate output head — the token-embedding tensor doubles as the lm-head. Setting --token-embedding-type f16 therefore gives an f16 embedding and f16 output head in one (no headQ6 variant needed — f16 already beats Q6 there).
02 · QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in — just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m FastContext-1.0-4B-SFT-ROCmFP4-STRIX-embF16-imatrix.gguf \
  --alias fastcontext-4b \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk f16 \
  -ctv f16 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch · CPU threads
-ctk f16 · -ctv f16f16 KV cache — how we run it (cheap on a 4B); drop to q8_0/q4_0 to use less memory at deep context
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.7 --top-p 0.8 --top-k 20Qwen3 recommended sampling (instruct/non-thinking)
--jinja --parallel 1 --metrics --no-mmapapply baked ChatML template · single slot · metrics · weights in RAM
NOTE // No --spec-* / --spec-type draft-mtp flags — this arch has no MTP head (see §04). It's already fast on its own.
03 · USING IT AS A SUBAGENT

FastContext isn't a general chat model — it's a repository-exploration subagent meant to be called by your main coding agent, not driven directly. The intended loop: the main agent delegates "find the relevant context for X" → FastContext issues parallel read-only tool calls (READ, GLOB, GREP) → returns compact file paths + line ranges, which the main agent folds into its own context to do the actual work. The point is to keep repo-exploration tokens out of the main agent's window.

  • Chat template: Qwen (ChatML) is baked into the GGUF — just pass --jinja.
  • Tool calling: it emits structured READ/GLOB/GREP calls — wire those tools into your harness and use a Qwen/Hermes-style tool-call parser so they're parsed rather than printed. See the upstream model card for the exact subagent protocol + tool schema (it expects a specific invocation format).
  • Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen3 instruct defaults) — already set in §02.
NOTE // It's small (4B) and fast (~68 t/s, §04) by design — a cheap, disposable explorer you can fan out in parallel next to a larger main model on the same box. The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps repeated exploration over the same repo cheap.
04 · PERFORMANCE & QUALITY
DECODE · short context~68 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODEnone (no MTP head)
CONTEXT256K native (dense attention)
QUANTIZATIONfast single-scale body + f16 tied emb/head + code-weighted imatrix

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. It keeps the one quality lever that's actually felt — genuine f16 embeddings, which on this model double as the output head (tie_word_embeddings=True), so a single f16 tensor sharpens both the input and output side at near-zero decode cost (it's a lookup, not a matmul) — on top of the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. A leaner Q5-embedding build would shave a couple tok/s but degrades that lever; we keep full f16.

We didn't re-run the entire rocmfp4 lever sweep on this 4B. We ran it exhaustively on the larger Qwen3.6-27B — KL divergence vs the BF16 reference plus llama-bench decode across an all-dual-scale body, selective higher-precision tensors, and full f16 embeddings. The finding there: an all-dual-scale body and selective higher-precision tensors both cost decode speed for a KL improvement that sat inside the measurement noise, so the fast single-scale body + f16 embeddings is the balance point. That conclusion carries to FastContext — same format, same kernels — so we ship the one build that lands on it rather than a slower variant that wins KL only inside the noise.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 GGUF of the base from microsoft/FastContext-1.0-4B-SFT — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, a Q6_K/Q8 of the base is the one to grab.

Fast on its own. ~68 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0, measured llama-bench tg128). It's a 4B dense Qwen3 with no MTP head, so there's no speculative decoding — it doesn't need it, and at 4B it's a cheap explorer you can run several of in parallel.

NOTE // imatrix. This build is quantized with an importance matrix (Kalomaze groups_merged + froggeric code/technical, via froggeric/imatrix), computed on this model's BF16. We did not run a separate imatrix-vs-no-imatrix ablation on this 4B; at 4+ bpw imatrix is a free polish, not a transformation. Scope note: any fidelity-vs-BF16 figures are a held-out measurement, not an absolute coding benchmark.
05 · BUILD (REPRODUCIBLE)
# 0) convert the safetensors -> BF16 GGUF (plain qwen3 dense; no MTP, tied embeddings)
python convert_hf_to_gguf.py FastContext-1.0-4B-SFT/ --outtype bf16 --outfile FastContext-1.0-4B-SFT-BF16.gguf

# 1) imatrix on the BF16 (general+code: Kalomaze groups_merged + froggeric code/technical)
llama-imatrix -m FastContext-1.0-4B-SFT-BF16.gguf -f general+code-calib.txt -o fastcontext-4b.imatrix -c 512 -ngl 999

# 2) THE ONE BUILD: fast single-scale STRIX body + f16 tied emb/head + imatrix (the ★ file) — the balance point (§04).
#    tie_word_embeddings=True -> --token-embedding-type f16 also gives an f16 output head; no --output-tensor-type.
llama-quantize --token-embedding-type f16 --imatrix fastcontext-4b.imatrix \
  FastContext-1.0-4B-SFT-BF16.gguf  FastContext-1.0-4B-SFT-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS
BASE MODELmicrosoft/FastContext-1.0-4B-SFT (MIT, Microsoft) · repository-exploration subagent · Qwen3 dense 4B (Qwen3ForCausalLM)
CALIBRATIONKalomaze groups_merged + froggeric code/technical via froggeric/imatrix
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF

Quantized
(17)
this model