Instructions to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF",
	filename="FastContext-1.0-4B-SFT-ROCmFP4-STRIX-embF16-imatrix.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Use Docker

docker model run hf.co/plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

LM Studio
Jan
Ollama
How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
```

Unsloth Studio

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF to start chatting

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16
```

Lemonade

How to use plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF:BF16

Run and chat with the model

lemonade run user.FastContext-1.0-4B-SFT-ROCmFP4-GGUF-BF16

List all available models

lemonade list

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
+
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████

██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
██████████████  ██████████████
FASTCONTEXT-1.0-4B
4-BIT ROCmFP4 · QWEN3 DENSE 4B · REPO-EXPLORATION SUBAGENT · CODE-WEIGHTED IMATRIX · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      ARCH
QWEN3 DENSE

      CONTEXT
256 K

    

      PARAMS
4B DENSE

      DRAFT
NO MTP

      BACKEND
VULKAN0

      LICENSE
MIT

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of microsoft/FastContext-1.0-4B-SFT — Microsoft's repository-exploration subagent for coding agents. Instead of one model both exploring the repo and solving the task, FastContext is invoked on demand by a main agent, fires parallel read-only tool calls (READ / GLOB / GREP), and returns compact file paths + line ranges as focused context. Architecturally it's a plain Qwen3 dense 4B (Qwen3ForCausalLM, 36 layers, hidden 2560, 256K context, MIT-licensed), here in the custom ROCmFP4 4-bit format, imatrix-quantized.

01 · FILES

File	Body	Size	Pick if
`…-STRIX-embF16-imatrix.gguf` ★	fast	2.7 GB	the one build — best speed/quality balance: f16 tied embeddings/head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the quality lever that's actually felt — genuine f16 embeddings (from BF16), which also serve as the output head since the model ties them — on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix (see §04). The Qwen (ChatML) chat template is baked into the GGUF — just pass --jinja.

NOTE // TIED EMBEDDINGS. FastContext has tie_word_embeddings=True, so there's no separate output head — the token-embedding tensor doubles as the lm-head. Setting --token-embedding-type f16 therefore gives an f16 embedding and f16 output head in one (no headQ6 variant needed — f16 already beats Q6 there).

02 · QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in — just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m FastContext-1.0-4B-SFT-ROCmFP4-STRIX-embF16-imatrix.gguf \
  --alias fastcontext-4b \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk f16 \
  -ctv f16 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it (cheap on a 4B); drop to `q8_0`/`q4_0` to use less memory at deep context
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.7 --top-p 0.8 --top-k 20`	Qwen3 recommended sampling (instruct/non-thinking)
`--jinja --parallel 1 --metrics --no-mmap`	apply baked ChatML template · single slot · metrics · weights in RAM

NOTE // No --spec-* / --spec-type draft-mtp flags — this arch has no MTP head (see §04). It's already fast on its own.

03 · USING IT AS A SUBAGENT

FastContext isn't a general chat model — it's a repository-exploration subagent meant to be called by your main coding agent, not driven directly. The intended loop: the main agent delegates "find the relevant context for X" → FastContext issues parallel read-only tool calls (READ, GLOB, GREP) → returns compact file paths + line ranges, which the main agent folds into its own context to do the actual work. The point is to keep repo-exploration tokens out of the main agent's window.

Chat template: Qwen (ChatML) is baked into the GGUF — just pass --jinja.
Tool calling: it emits structured READ/GLOB/GREP calls — wire those tools into your harness and use a Qwen/Hermes-style tool-call parser so they're parsed rather than printed. See the upstream model card for the exact subagent protocol + tool schema (it expects a specific invocation format).
Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen3 instruct defaults) — already set in §02.

NOTE // It's small (4B) and fast (~68 t/s, §04) by design — a cheap, disposable explorer you can fan out in parallel next to a larger main model on the same box. The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps repeated exploration over the same repo cheap.

04 · PERFORMANCE & QUALITY

DECODE · short context	~68 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODE	none (no MTP head)
CONTEXT	256K native (dense attention)
QUANTIZATION	fast single-scale body + f16 tied emb/head + code-weighted imatrix

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. It keeps the one quality lever that's actually felt — genuine f16 embeddings, which on this model double as the output head (tie_word_embeddings=True), so a single f16 tensor sharpens both the input and output side at near-zero decode cost (it's a lookup, not a matmul) — on top of the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. A leaner Q5-embedding build would shave a couple tok/s but degrades that lever; we keep full f16.

We didn't re-run the entire rocmfp4 lever sweep on this 4B. We ran it exhaustively on the larger Qwen3.6-27B — KL divergence vs the BF16 reference plus llama-bench decode across an all-dual-scale body, selective higher-precision tensors, and full f16 embeddings. The finding there: an all-dual-scale body and selective higher-precision tensors both cost decode speed for a KL improvement that sat inside the measurement noise, so the fast single-scale body + f16 embeddings is the balance point. That conclusion carries to FastContext — same format, same kernels — so we ship the one build that lands on it rather than a slower variant that wins KL only inside the noise.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 GGUF of the base from microsoft/FastContext-1.0-4B-SFT — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, a Q6_K/Q8 of the base is the one to grab.

Fast on its own. ~68 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0, measured llama-bench tg128). It's a 4B dense Qwen3 with no MTP head, so there's no speculative decoding — it doesn't need it, and at 4B it's a cheap explorer you can run several of in parallel.

NOTE // imatrix. This build is quantized with an importance matrix (Kalomaze groups_merged + froggeric code/technical, via froggeric/imatrix), computed on this model's BF16. We did not run a separate imatrix-vs-no-imatrix ablation on this 4B; at 4+ bpw imatrix is a free polish, not a transformation. Scope note: any fidelity-vs-BF16 figures are a held-out measurement, not an absolute coding benchmark.

05 · BUILD (REPRODUCIBLE)

# 0) convert the safetensors -> BF16 GGUF (plain qwen3 dense; no MTP, tied embeddings)
python convert_hf_to_gguf.py FastContext-1.0-4B-SFT/ --outtype bf16 --outfile FastContext-1.0-4B-SFT-BF16.gguf

# 1) imatrix on the BF16 (general+code: Kalomaze groups_merged + froggeric code/technical)
llama-imatrix -m FastContext-1.0-4B-SFT-BF16.gguf -f general+code-calib.txt -o fastcontext-4b.imatrix -c 512 -ngl 999

# 2) THE ONE BUILD: fast single-scale STRIX body + f16 tied emb/head + imatrix (the ★ file) — the balance point (§04).
#    tie_word_embeddings=True -> --token-embedding-type f16 also gives an f16 output head; no --output-tensor-type.
llama-quantize --token-embedding-type f16 --imatrix fastcontext-4b.imatrix \
  FastContext-1.0-4B-SFT-BF16.gguf  FastContext-1.0-4B-SFT-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

BASE MODEL	microsoft/FastContext-1.0-4B-SFT (MIT, Microsoft) · repository-exploration subagent · Qwen3 dense 4B (`Qwen3ForCausalLM`)
CALIBRATION	Kalomaze `groups_merged` + froggeric `code`/`technical` via froggeric/imatrix
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: -

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/FastContext-1.0-4B-SFT-ROCmFP4-GGUF

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

microsoft/FastContext-1.0-4B-SFT

Quantized

(17)

this model

FORMAT ROCmFP4 4-BIT	PRECISION ~4.5 BPW	ARCH QWEN3 DENSE	CONTEXT 256 K
PARAMS 4B DENSE	DRAFT NO MTP	BACKEND VULKAN0	LICENSE MIT