banner

KIKOCIS // LONG-CONTEXT IMATRIX GGUF // PRESERVED & QUANTIZED
 repo/                    ┌─────────┐
 ├── src/         ═══════▶│ ◉ 4B    │
 │   ├── auth.rs  ◀═══════│ scout   │
 │   └── db.rs            └─────────┘
 ├── lib/                READ·GLOB·GREP
 │   └── core.rs  ──▶ auth.rs:41-77
 └── tests/       ──▶ core.rs:102-130
      256K ctx        only what you need
FASTCONTEXT-1.0-4B
256K REPO-EXPLORER · QWEN3 DENSE 4B · LONG-CTX IMATRIX · 8.0 GB → 1.96 GB
FORMAT
GGUF · IQ3_M / Q4_K_M
SIZE
1.96 / 2.50 GB
ARCH
QWEN3 DENSE · 36L
CONTEXT
256K NATIVE
IMATRIX
LONG-CONTEXT CALIB
RETRIEVAL @5K
30/30 = BF16
RUNS ON
METAL·CUDA·CPU·VULKAN
LICENSE
MIT

Microsoft open-sourced it, then deleted it from HuggingFace and GitHub (verified: 404 on both). These are long-context-imatrix GGUF quants so the weights stay in your hands — the full preserved original (bf16, 8.0 GB) is at KikoCis/FastContext-1.0-4B-SFT. Own your AI.

🔍 What FastContext is (and why it's special)

FastContext isn't a chatbot — it's a repository-exploration subagent for coding agents. Your main agent (Claude Code, Copilot, Cursor, OpenHands…) delegates file discovery to it:

  1. Main agent asks: "where is auth handled?"
  2. FastContext fires parallel read-only tool callsREAD / GLOB / GREP — across the repo,
  3. and returns just the file paths + line ranges you need as compact, focused context.

Your expensive frontier agent stops burning tokens crawling directories. Microsoft's (now-deleted) announcement reported ~60% fewer tokens from the main agent and +5.5% on SWE-benchtheir figures, not independently reproduced here.

📦 Which file should I pick?

file bits size vs original pick this if…
fastcontext4b.IQ3_M.imx.gguf ~3.3 1.96 GB 4.1× smaller tightest RAM — smallest FastContext GGUF anywhere, retrieval-validated
fastcontext4b.Q4_K_M.imx.gguf ~4.5 2.50 GB 3.2× smaller the safe default — more headroom for long contexts

K-quants (Q4_K_M) = solid general quants. I-quants (IQ3_M) = smaller at similar quality; they need an imatrix (we ship ours: fastcontext4b.imatrix).

What's different vs the other FastContext GGUFs: the importance matrix here is calibrated on long, multi-thousand-token sequences (LongAlign), not the usual short generic corpus — matching the 256K regime this model was built for. For AMD Strix Halo specifically, see plunderstruck's ROCmFP4 build (different target, code-weighted imatrix).

🧮 Will it fit? (RAM/VRAM cheat-sheet)

Total ≈ weights + KV-cache (KV grows with context):

you have quant context you can run
4 GB IQ3_M ~8–16K
6 GB IQ3_M / Q4_K_M ~32K
8 GB Q4_K_M ~64–128K
12 GB+ Q4_K_M up to 256K native

🚀 How to run it

# llama.cpp — point it at your repo dump, ask for locations:
llama-cli -m fastcontext4b.Q4_K_M.imx.gguf -c 32768 \
  -p "…repo contents…\n\nWhere is authentication handled? Return file:line ranges only."

# llama-server (use it as a subagent endpoint for your main coding agent):
llama-server -m fastcontext4b.Q4_K_M.imx.gguf -c 65536 --port 8091

# Ollama (Modelfile included, 32K default):
ollama create fastcontext -f Modelfile && ollama run fastcontext

Recommended sampling: temperature 0.6, top_p 0.9, top_k 20. For pure retrieval calls, temperature 0 works well. Subagent pattern: keep FastContext resident on a cheap local endpoint; have your main agent call it for "where is X?" queries and inject only the returned ranges into its own context.

📊 Validation — measured on these files (honest)

Needle-in-haystack retrieval (find an inserted fact inside real long documents), greedy decoding:

model needle retrieval @~5K ctx
original (bf16) 30/30
Q4_K_M (imx) 30/30
IQ3_M (imx) 30/30

At 5K context all three — including the aggressive IQ3_M — match the original bf16 perfectly: quantization is lossless for retrieval here. Deeper long-context numbers will be added once measured on a clean harness — no placeholder claims.

  • Harness: llama-server + OpenAI-compat API, temp 0, 30 tasks, haystacks built from real LongAlign documents, deterministic gold.
  • Date: 2026-07-02.

⚠️ Good to know

  • Strengths: repo exploration, long-document retrieval, read-only tool calling (READ/GLOB/GREP), returning compact file:line evidence.
  • It's a scout, not a solver — pair it with your main coding agent; don't expect it to write the patch itself.
  • The original repo is gone, so upstream docs/issues are gone with it; the harness conventions above are from the model's own announcement and community usage.

🗒️ Changelog

  • 2026-07-02 v1 — IQ3_M + Q4_K_M with long-context imatrix; retrieval validated @5K (30/30 all); imatrix + Modelfile included; original preserved in the sibling repo.

📚 Credit & license

Model, weights, training: © Microsoft — FastContext-1.0-4B-SFT (MIT), sourced via the ShaunGves re-upload after the original was removed. Quantization + long-context imatrix + validation: KikoCis. MIT (same as upstream). No weights modified — faithful quantization only.

Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF

Quantized
(26)
this model