FastContext-1.0-4B-SFT-oQ6

An oQ6 quantized version of FastContext-1.0-4B-SFT optimized for Apple Silicon using oMLX.

This model preserves the repository exploration capabilities of FastContext while significantly reducing memory usage and improving inference efficiency through mixed-precision oQ quantization.

About FastContext

FastContext is a lightweight repository-exploration subagent designed for coding agents. Instead of having a single model perform both repository exploration and problem solving, FastContext specializes in repository discovery and evidence gathering using parallel tool calls.

The model explores repositories through:

  • READ
  • GLOB
  • GREP

and returns concise file paths and line references for downstream coding agents.

Original model: FastContext-1.0-4B-SFT.

Quantization

This release uses:

  • Quantization: oQ6
  • Format: MLX
  • Target Platform: Apple Silicon
  • Mixed Precision: Enabled
  • Optimized for local inference

The oQ quantization pipeline allocates higher precision to more sensitive weights while aggressively compressing less important regions of the network, providing a strong quality-to-size ratio.

Recommended Inference Settings

For best performance:

temperature: 0.7
top_p: 0.6
top_k: 20
min_p: 0
repetition_penalty: 1.05
presence_penalty: 1.5
thinking: true

oMLX Preset

temp: 0.7
top_p: 0.6
top_k: 20
min_p: 0
rep_penalty: 1.05
presence_penalty: 1.5
enable_thinking: true

These settings were selected to improve repository exploration quality, encourage broader search behavior, and maintain stable citation generation.

Example Usage

from mlx_lm import load, generate

model, tokenizer = load("FastContext-1.0-4B-SFT-oQ6")

prompt = "Find where authentication tokens are validated."

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    temp=0.7,
    top_p=0.6,
    top_k=20,
)

print(response)

Intended Use

This model is intended for:

  • Repository exploration
  • Codebase navigation
  • SWE-bench style workflows
  • Coding agents
  • Retrieval and evidence gathering
  • Search-heavy software engineering tasks

It is not intended to replace a primary coding model. FastContext works best as a specialized exploration subagent paired with a stronger reasoning or code-generation model.

Performance

FastContext was trained specifically to improve repository exploration efficiency and reduce the token overhead associated with repository search. The original paper reports improved end-to-end coding-agent performance while reducing token consumption across multiple SWE benchmarks.

Recommended Deployment

Apple Silicon:

  • M1 Pro / Max
  • M2 Pro / Max / Ultra
  • M3 Series
  • M4 Series

Works well with:

  • MLX
  • oMLX
  • Open WebUI
  • LM Studio (MLX builds)
  • Custom agent frameworks

Credits

  • Microsoft FastContext Team
  • Qwen Team
  • Apple MLX
  • oMLX

Citation

Please cite the original FastContext paper when using this model in research:

@misc{zhang2026fastcontexttrainingefficientrepository,
      title={FastContext: Training Efficient Repository Explorer for Coding Agents},
      author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
      year={2026},
      eprint={2606.14066},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}
Downloads last month
-
Safetensors
Model size
0.9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yugeshkarunamurthy/FastContext-1.0-4B-oQ6