Instructions to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="notSnix/Step-3.7-Flash-Q4_K_M-GGUF",
	filename="Step-3.7-Flash-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Use Docker

docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "notSnix/Step-3.7-Flash-Q4_K_M-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "notSnix/Step-3.7-Flash-Q4_K_M-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Ollama
How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Ollama:
```
ollama run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
```

Unsloth Studio

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-GGUF to start chatting

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Docker Model Runner:
```
docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M
```

Lemonade

How to use notSnix/Step-3.7-Flash-Q4_K_M-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull notSnix/Step-3.7-Flash-Q4_K_M-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Step-3.7-Flash-Q4_K_M-GGUF-Q4_K_M

List all available models

lemonade list

Step 3.7 Flash Q4_K_M GGUF

This repo contains the full text-side GGUF quantization of stepfun-ai/Step-3.7-Flash.

For speculative decoding, use the companion MTP draft GGUFs here:

notSnix/Step-3.7-Flash-MTP-Draft-GGUF

The source model is Apache-2.0. The original model is multimodal, but this GGUF artifact was prepared and tested for text-side llama.cpp serving.

Files

File	Size	SHA256	Purpose
`Step-3.7-Flash-Q4_K_M.gguf`	111 GB	`4de6519cf0131820d81137ebe6a0ab8dc225f1c463cc385038ab7de41ee7a36f`	Full model
`chat_template.jinja`	5.6 KB	`f428623fc81c940c35be3509fbffc086b4b4360d8800e46103e6f34d02891633`	Chat template

Runtime

Current llama.cpp main supports Step MTP draft loading natively when used with the companion draft repo. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

Basic Command

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-file chat_template.jinja

With MTP Draft

Download an MTP draft GGUF from notSnix/Step-3.7-Flash-MTP-Draft-GGUF, then run:

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.60 \
  --chat-template-file chat_template.jinja

Local Benchmark Snapshot

GPUs: RTX PRO 6000, 3x RTX 3090.

Recommended local MTP setting from the tested sweep: --spec-draft-n-max 2 --spec-draft-p-min 0.60 with the Q8_0 draft.

Run	Prompt tokens	Prefill	Decode	TTFT	Notes
Q4_K_M + Q8_0 MTP `n_max=2 p_min=0.60`	32,769	1823.47 tok/s	104.38 tok/s	18.054 s	87.1% draft accepted
Q4_K_M + BF16 MTP `n_max=2 p_min=0.60`	32,769	1835.66 tok/s	93.38 tok/s	17.904 s	79.3% draft accepted
Q4_K_M + BF16 MTP `n_max=2 p_min=0.60`	65,537	1626.84 tok/s	94.79 tok/s	40.391 s	81.2% draft accepted
Q4_K_M + MTP `n_max=3`	604	-	143.81 tok/s	0.415 s	172/181 draft accepted, 95.0%
Q4_K_M + MTP `n_max=3`	32,519	2097.79 tok/s	104.91 tok/s	15.62 s	60/73 draft accepted, 82.2%
Q4_K_M + MTP `n_max=3`	54,619	1909.23 tok/s	106.73 tok/s	28.82 s	60/70 draft accepted, 85.7%
Q4_K_S baseline	604	1738.12 tok/s	110.70 tok/s	0.352 s	no MTP
Q4_K_S baseline	54,619	2194.42 tok/s	89.15 tok/s	25.16 s	no MTP

Limited task checks:

Check	Q4_K_S baseline	Q4_K_M + MTP `n_max=3`
ARC Challenge chat, 10 samples	0.9	0.9
GSM8K strict/flexible, 10 samples	0.9 / 0.9	0.8 / 0.8
Code needle / NIAH reasoning-aware	12/12	12/12

Checksums

sha256sum -c SHA256SUMS

Notes

The base model advertises 256k context; this GGUF release was loaded locally at 256k context.
The MTP draft GGUFs are companion files for speculative decoding and are hosted separately to avoid confusing them with full-model quants.
This is a community GGUF quantization/repackaging of the upstream Apache-2.0 model, not an official StepFun release.

Downloads last month: -

GGUF

Model size

197B params

Architecture

step35

Hardware compatibility

4-bit

Model tree for notSnix/Step-3.7-Flash-Q4_K_M-GGUF

Base model

stepfun-ai/Step-3.7-Flash

Quantized

(23)

this model