Instructions to use microsoft/FastContext-1.0-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/FastContext-1.0-4B-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/FastContext-1.0-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/FastContext-1.0-4B-SFT")
model = AutoModelForMultimodalLM.from_pretrained("microsoft/FastContext-1.0-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/FastContext-1.0-4B-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/FastContext-1.0-4B-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/FastContext-1.0-4B-SFT

SGLang

How to use microsoft/FastContext-1.0-4B-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/FastContext-1.0-4B-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/FastContext-1.0-4B-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/FastContext-1.0-4B-SFT with Docker Model Runner:
```
docker model run hf.co/microsoft/FastContext-1.0-4B-SFT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

1. Model Introduction

FastContext-1.0 is a lightweight repository-exploration subagent for LLM coding agents. Instead of letting a single model both explore the repository and solve the task, FastContext separates these two roles: it is invoked on demand by a main coding agent, issues parallel read-only tool calls (READ, GLOB, GREP), and returns compact file paths and line ranges as focused context.

Repository exploration is a major bottleneck in modern coding agents — locating relevant code consumes a large share of the token budget and pollutes the solver's context with irrelevant snippets. In our analysis of GPT-5.4 trajectories, reading and searching account for 56.2% of all tool-use turns and 46.5% of the main agent's total tokens. FastContext moves this work into a dedicated subagent so the main agent receives clean, grounded evidence rather than the long trail of exploratory reads and searches.

The model family spans 4B–30B parameters, bootstrapped from strong reference-model trajectories via supervised fine-tuning (SFT) and refined with task-grounded reinforcement learning (RL) for broad first-turn search, multi-turn evidence gathering, and precise citation generation.

Backbones: Qwen3-4B-Instruct (4B explorer) and Qwen3-Coder-30B-A3B (30B explorer)
Variants: FC-4B-SFT, FC-4B-RL (deployment targets), FC-30B-SFT (scaling reference)
Context length: up to 262K tokens
Paper: FastContext: Training Efficient Repository Explorer for Coding Agents
Code & data: https://github.com/microsoft/fastcontext

How it works

Coding Agent ──query──▶  FastContext  ──read/search──▶  Repository
     ▲                       │
     └──── file-line ────────┘
          citations

Internally, FastContext runs an exploration loop:

Query understanding — translate the issue into search intents.
Parallel tool calling — issue multiple READ / GLOB / GREP calls in a single turn to cover complementary hypotheses.
Observation-driven refinement — use tool outputs to guide the next search turn.
Final citations — return a compact <final_answer> block of file paths and line ranges.

2. Evaluation Results

End-to-end performance (Mini-SWE-Agent)

Integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates by up to 5.5% while reducing main-agent token consumption by up to 60%, with only marginal overhead. Scores, tokens, and turns are measured on the main-agent trajectory; deltas are relative to w/o Explore for the same main agent.

Main Agent	Subagent	SWE-bench Multilingual	SWE-bench Pro	SWE-QA
GPT-5.4	w/o Explore	71.7 / 457k	46.0 / 818k	81.3 / 418k
	FC-30B-SFT	75.0 (↑3.3) / 356k (↓22.1%)	49.0 (↑3.0) / 688k (↓15.9%)	82.0 (↑0.7) / 206k (↓50.7%)
	FC-4B-SFT	73.3 (↑1.6) / 364k (↓20.4%)	47.0 (↑1.0) / 689k (↓15.8%)	81.9 (↑0.6) / 213k (↓49.0%)
	FC-4B-RL	74.7 (↑3.0) / 338k (↓26.0%)	48.5 (↑2.5) / 701k (↓14.3%)	82.0 (↑0.7) / 210k (↓49.8%)
GLM-5.1	w/o Explore	72.3 / 2514k	17.5 / 2692k	72.7 / 401k
	FC-30B-SFT	73.7 (↑1.4) / 1797k (↓28.5%)	20.0 (↑2.5) / 2370k (↓12.0%)	73.3 (↑0.6) / 292k (↓27.2%)
	FC-4B-SFT	73.3 (↑1.0) / 1919k (↓23.7%)	18.0 (↑0.5) / 2279k (↓15.3%)	73.4 (↑0.7) / 306k (↓23.7%)
	FC-4B-RL	73.7 (↑1.4) / 1971k (↓21.6%)	22.5 (↑5.0) / 2210k (↓17.9%)	73.5 (↑0.8) / 302k (↓24.7%)
Kimi-K2.6	w/o Explore	76.3 / 1553k	31.0 / 2383k	71.6 / 510k
	FC-30B-SFT	76.7 (↑0.4) / 1360k (↓12.4%)	33.0 (↑2.0) / 2150k (↓9.8%)	72.8 (↑1.2) / 373k (↓26.9%)
	FC-4B-SFT	75.3 (↓1.0) / 1306k (↓15.9%)	32.5 (↑1.5) / 2159k (↓9.4%)	72.6 (↑1.0) / 402k (↓21.2%)
	FC-4B-RL	78.3 (↑2.0) / 1384k (↓10.9%)	33.5 (↑2.5) / 2158k (↓9.4%)	72.6 (↑1.0) / 378k (↓25.9%)

Score / Tokens shown per cell. Best result per main-agent block in bold.

Highlights:

FastContext improves end-to-end accuracy for every main agent and benchmark; the largest gains appear on SWE-bench Pro (e.g. GPT-5.4 +5.5, GLM-5.1 +5.0).
The biggest token savings reach 60.3% (GPT-5.4 on SWE-QA).
The compact 4B-RL explorer can outperform the larger 30B-SFT explorer — e.g. on GLM-5.1 SWE-bench Pro it reaches 22.5 vs. 20.0 while using fewer tokens.

3. Quick Start

Launch the model with an OpenAI-compatible server (e.g. SGLang). The example below serves the 4B explorer:

python3 -m sglang.launch_server \
    --model-path FastContext-1.0-4B-SFT \
    --tool-call-parser qwen \
    --context-length 262144 \
    --trust-remote-code \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1 \
    --mem-fraction-static 0.8

FastContext exposes only three read-only tools to the model:

Tool	Purpose
`READ`	Return line-numbered file contents
`GLOB`	Path discovery by glob pattern
`GREP`	Regex search over repository text (ripgrep-style)

At each turn the explorer either issues one or more (parallel) tool calls or stops with a final <final_answer> evidence list. Wire FastContext into a coding agent (e.g. Mini-SWE-Agent) as an exploration subagent the main agent can invoke on demand.

4. Training Recipe

FastContext is trained in two stages:

Supervised fine-tuning (SFT): The exploration traces, split into three sources matching the runtime behavior of the subagent — parallel_toolcalls (broad first-turn search), multiturn_traj (multi-turn evidence gathering), and linerange (precise citation generation).
Reinforcement learning (RL): The model is rolled out as the actual subagent and optimized with GRPO using a deterministic reward combining file- and line-level F1, a bonus for bounded parallel exploration, and format penalties.

License

This project is licensed under the MIT License.

Downloads last month: 13

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for microsoft/FastContext-1.0-4B-SFT

Quantizations

2 models

Collection including microsoft/FastContext-1.0-4B-SFT

SWE-FastContext

Collection

A family of code-search models powering the Explore subagent for coding agents. • 2 items • Updated 1 day ago • 4