Instructions to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF", filename="fastcontext4b.IQ3_M.imx.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M # Run inference directly in the terminal: llama cli -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M # Run inference directly in the terminal: llama cli -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M # Run inference directly in the terminal: ./llama-cli -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Use Docker
docker model run hf.co/KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
- LM Studio
- Jan
- vLLM
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
- Ollama
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Ollama:
ollama run hf.co/KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
- Unsloth Studio
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF to start chatting
- Pi
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Docker Model Runner:
docker model run hf.co/KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
- Lemonade
How to use KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF:IQ3_M
Run and chat with the model
lemonade run user.FastContext-1.0-4B-longctx-imatrix-GGUF-IQ3_M
List all available models
lemonade list
repo/ ┌─────────┐
├── src/ ═══════▶│ ◉ 4B │
│ ├── auth.rs ◀═══════│ scout │
│ └── db.rs └─────────┘
├── lib/ READ·GLOB·GREP
│ └── core.rs ──▶ auth.rs:41-77
└── tests/ ──▶ core.rs:102-130
256K ctx only what you need
FORMAT GGUF · IQ3_M / Q4_K_M |
SIZE 1.96 / 2.50 GB |
ARCH QWEN3 DENSE · 36L |
CONTEXT 256K NATIVE |
IMATRIX LONG-CONTEXT CALIB |
RETRIEVAL @5K 30/30 = BF16 |
RUNS ON METAL·CUDA·CPU·VULKAN |
LICENSE MIT |
Microsoft open-sourced it, then deleted it from HuggingFace and GitHub (verified: 404 on both). These are long-context-imatrix GGUF quants so the weights stay in your hands — the full preserved original (bf16, 8.0 GB) is at KikoCis/FastContext-1.0-4B-SFT. Own your AI.
🔍 What FastContext is (and why it's special)
FastContext isn't a chatbot — it's a repository-exploration subagent for coding agents. Your main agent (Claude Code, Copilot, Cursor, OpenHands…) delegates file discovery to it:
- Main agent asks: "where is auth handled?"
- FastContext fires parallel read-only tool calls —
READ/GLOB/GREP— across the repo, - and returns just the file paths + line ranges you need as compact, focused context.
Your expensive frontier agent stops burning tokens crawling directories. Microsoft's (now-deleted) announcement reported ~60% fewer tokens from the main agent and +5.5% on SWE-bench — their figures, not independently reproduced here.
📦 Which file should I pick?
| file | bits | size | vs original | pick this if… |
|---|---|---|---|---|
fastcontext4b.IQ3_M.imx.gguf |
~3.3 | 1.96 GB | 4.1× smaller | tightest RAM — smallest FastContext GGUF anywhere, retrieval-validated |
fastcontext4b.Q4_K_M.imx.gguf |
~4.5 | 2.50 GB | 3.2× smaller | the safe default — more headroom for long contexts |
K-quants (Q4_K_M) = solid general quants. I-quants (IQ3_M) = smaller at similar quality; they need an imatrix (we ship ours: fastcontext4b.imatrix).
What's different vs the other FastContext GGUFs: the importance matrix here is calibrated on long, multi-thousand-token sequences (LongAlign), not the usual short generic corpus — matching the 256K regime this model was built for. For AMD Strix Halo specifically, see plunderstruck's ROCmFP4 build (different target, code-weighted imatrix).
🧮 Will it fit? (RAM/VRAM cheat-sheet)
Total ≈ weights + KV-cache (KV grows with context):
| you have | quant | context you can run |
|---|---|---|
| 4 GB | IQ3_M | ~8–16K |
| 6 GB | IQ3_M / Q4_K_M | ~32K |
| 8 GB | Q4_K_M | ~64–128K |
| 12 GB+ | Q4_K_M | up to 256K native |
🚀 How to run it
# llama.cpp — point it at your repo dump, ask for locations:
llama-cli -m fastcontext4b.Q4_K_M.imx.gguf -c 32768 \
-p "…repo contents…\n\nWhere is authentication handled? Return file:line ranges only."
# llama-server (use it as a subagent endpoint for your main coding agent):
llama-server -m fastcontext4b.Q4_K_M.imx.gguf -c 65536 --port 8091
# Ollama (Modelfile included, 32K default):
ollama create fastcontext -f Modelfile && ollama run fastcontext
Recommended sampling: temperature 0.6, top_p 0.9, top_k 20. For pure retrieval calls, temperature 0 works well. Subagent pattern: keep FastContext resident on a cheap local endpoint; have your main agent call it for "where is X?" queries and inject only the returned ranges into its own context.
📊 Validation — measured on these files (honest)
Needle-in-haystack retrieval (find an inserted fact inside real long documents), greedy decoding:
| model | needle retrieval @~5K ctx |
|---|---|
| original (bf16) | 30/30 |
| Q4_K_M (imx) | 30/30 |
| IQ3_M (imx) | 30/30 |
At 5K context all three — including the aggressive IQ3_M — match the original bf16 perfectly: quantization is lossless for retrieval here. Deeper long-context numbers will be added once measured on a clean harness — no placeholder claims.
- Harness: llama-server + OpenAI-compat API, temp 0, 30 tasks, haystacks built from real LongAlign documents, deterministic gold.
- Date: 2026-07-02.
⚠️ Good to know
- Strengths: repo exploration, long-document retrieval, read-only tool calling (READ/GLOB/GREP), returning compact file:line evidence.
- It's a scout, not a solver — pair it with your main coding agent; don't expect it to write the patch itself.
- The original repo is gone, so upstream docs/issues are gone with it; the harness conventions above are from the model's own announcement and community usage.
🗒️ Changelog
- 2026-07-02 v1 — IQ3_M + Q4_K_M with long-context imatrix; retrieval validated @5K (30/30 all); imatrix + Modelfile included; original preserved in the sibling repo.
📚 Credit & license
Model, weights, training: © Microsoft — FastContext-1.0-4B-SFT (MIT), sourced via the ShaunGves re-upload after the original was removed. Quantization + long-context imatrix + validation: KikoCis. MIT (same as upstream). No weights modified — faithful quantization only.
- Downloads last month
- -
3-bit
4-bit
Model tree for KikoCis/FastContext-1.0-4B-longctx-imatrix-GGUF
Base model
microsoft/FastContext-1.0-4B-SFT