Instructions to use microsoft/FastContext-1.0-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/FastContext-1.0-4B-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/FastContext-1.0-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/FastContext-1.0-4B-SFT")
model = AutoModelForMultimodalLM.from_pretrained("microsoft/FastContext-1.0-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/FastContext-1.0-4B-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/FastContext-1.0-4B-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/FastContext-1.0-4B-SFT

SGLang

How to use microsoft/FastContext-1.0-4B-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/FastContext-1.0-4B-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/FastContext-1.0-4B-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/FastContext-1.0-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/FastContext-1.0-4B-SFT with Docker Model Runner:
```
docker model run hf.co/microsoft/FastContext-1.0-4B-SFT
```

FastContext-1.0-4B-SFT in production agent integration

by ghostwithahat - opened 3 days ago

Discussion

ghostwithahat

3 days ago

•

edited 3 days ago

Great Idea! Thank you!

I integrated FastContext-1.0-4B-SFT as an explore_repository subagent into a production Go-based coding agent (ahle). Served unquantized via llama.cpp on RTX 3090 (104K ctx, 80 GPU layers, temperature 0.0). Findings after ~30 real-world runs:

What works:

Tool selection is good — the model prefers grep first, then targeted reads
With a directory listing in the system prompt (as in system.md), path hallucination drops to near zero
The model finds the right files roughly 60% of the time

What doesn't:

<final_answer> tags are inconsistent — the model often writes correct citation text but omits the XML wrapper. I had to add a regex fallback to extract bare /path/file.go:42-58 (reason) lines.
Line ranges are too broad (file.go:1-500) even when the answer spans 20 lines. The main agent (DeepSeek v4) re-reads cited files manually because it cannot trust coarse ranges. Net token savings: ~0%.
A "last turn" reminder system message helped, but only partially.

I'm testing the 4B-RL variant next, hoping the format penalties and line-level F1 reward produce tighter citations. Happy to share comparison results.

maoquan-ms

Microsoft org 1 day ago

thanks for the feedback,
since this basic version of the model (4B) was only fine-tuned (SFT) on a 3k trajectory dataset, hallucinations regarding file paths are unfortunately expected. A much more robust version will be updated soon.
if you can share your specific bad cases here (repos), it would really help us improve.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment