Instructions to use malgamves/peripheral-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use malgamves/peripheral-8b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="malgamves/peripheral-8b",
	filename="peripheral-8b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use malgamves/peripheral-8b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf malgamves/peripheral-8b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf malgamves/peripheral-8b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf malgamves/peripheral-8b:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf malgamves/peripheral-8b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf malgamves/peripheral-8b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf malgamves/peripheral-8b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf malgamves/peripheral-8b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf malgamves/peripheral-8b:Q4_K_M

Use Docker

docker model run hf.co/malgamves/peripheral-8b:Q4_K_M

LM Studio
Jan

vLLM

How to use malgamves/peripheral-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "malgamves/peripheral-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "malgamves/peripheral-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/malgamves/peripheral-8b:Q4_K_M

Ollama
How to use malgamves/peripheral-8b with Ollama:
```
ollama run hf.co/malgamves/peripheral-8b:Q4_K_M
```

Unsloth Studio

How to use malgamves/peripheral-8b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for malgamves/peripheral-8b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for malgamves/peripheral-8b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for malgamves/peripheral-8b to start chatting

How to use malgamves/peripheral-8b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf malgamves/peripheral-8b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "malgamves/peripheral-8b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use malgamves/peripheral-8b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf malgamves/peripheral-8b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default malgamves/peripheral-8b:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use malgamves/peripheral-8b with Docker Model Runner:
```
docker model run hf.co/malgamves/peripheral-8b:Q4_K_M
```

Lemonade

How to use malgamves/peripheral-8b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull malgamves/peripheral-8b:Q4_K_M

Run and chat with the model

lemonade run user.peripheral-8b-Q4_K_M

List all available models

lemonade list

Peripheral (Qwen3-8B)

Peripheral is a fine-tuned 8B knowledge-management agent (KMA) that manages filesystem-based knowledge bases. Given a query it routes to the right files, decides whether a recent change is safe to serve, and classifies what that change actually is. It runs locally at zero cost per call, and on the knowledge-management tasks it was built for it beats frontier models while being roughly five times faster and five times cheaper.

This is the model behind the write-up Not all Context is Knowledge.

Peripheral is a specialist that was trained on about 7,000 task examples and beats a zero-shot generalist, but it transfers only partially to unseen domains and grows less reliable the further it gets from what it saw in training. Read the Limitations before you deploy it.

What it does

Peripheral is the judgement layer on top of a markdown knowledge base, and it performs three read-path tasks, each selected by a tag in the system prompt.

Task	Tag	Question it answers	Output keys
Eval	`[EVAL]`	Is this change correct, incorrect, or mixed?	`verdict` (`accept`/`reject`/`partial`), `reasoning`, `corrected_content`, `confidence`
Routing	`[ROUTE]`	Which files answer this query?	`selected_files`, `reasoning`, `confidence`
Gate	`[GATE]`	Serve, annotate, or block this content?	`gate_decision` (`serve`/`annotate`/`block`), `reasoning`, `risk_level`, `confidence`

The output is always a single JSON object. There is a fourth write-path task, [ORGANIZE], for deciding where new knowledge belongs, but it is only scaffolded in the repository and was not trained into this model, so you should not rely on it.

Results (KMA-Bench, 226 cases, % correct)

Averaged over multiple runs across three knowledge bases (French wiki, ClickHouse docs, and the PostHog handbook, with 14 cases drawn from real git commits).

Model	Diff Eval	Routing	Gate	Overall	Latency
Heuristic	44%	75%	100%	64%	0ms
Base Qwen3-8B	46%*	13%*	16%*	30%*	~3,500ms
Peripheral (this model)	71%	88%	96%	81%	~760ms
Claude Sonnet 4.6 (zero-shot)	60%	96%	83%	76%	~4,100ms

* The base model cannot reliably emit valid JSON, and fine-tuning is what carried it from 30% to 81%.

A smaller 10-case quick benchmark is in the repo README; its numbers differ slightly because the sample is different.

Data note. The ClickHouse and PostHog knowledge bases were used only for read-only generalization testing, and the released model was trained on French content alone, so it contains none of their text. The ClickHouse documentation is © ClickHouse, used under CC BY-NC-SA 4.0, and the PostHog handbook is © PostHog Inc., used under MIT.

How to use

LM Studio (recommended)

In LM Studio, search malgamves/peripheral-8b and download the Q4_K_M GGUF (it runs in about 4.8GB of VRAM).
Load the model.
Give the model the task instruction for what you want, one of [EVAL], [ROUTE], or [GATE] (the full text is under "Prompt format" below), followed by your structured input. The [TAG] is what selects the task, and you can put the instruction in LM Studio's System Prompt field or at the top of your message.
Send it. You get a single JSON object back.

Keep the temperature low, around 0.1, since the model is trained to emit short, deterministic JSON.

llama.cpp

Download the Q4_K_M GGUF and run it with your usual llama.cpp setup, using the same prompt format as above.

Ollama (optional)

A Modelfile is in the repository if you prefer Ollama: ollama create it, with the [EVAL] prompt baked in as the default system message, and then ollama run. LM Studio is still the recommended path.

Prompt format

Each task is a tagged instruction followed by a structured input. The [TAG] is the operative signal that selects the task, and whether you place the instruction in the system field or inline with your input, you should keep the format close to the examples below, because the model is sensitive to it.

System prompts

[EVAL] You evaluate changes to a knowledge base. Given a diff and old content, assess whether the change is correct (accept), incorrect (reject), or mixed (partial). Respond in JSON: verdict, reasoning, corrected_content, confidence.

[ROUTE] You select files from a knowledge base to answer a query. Given a query and file list, select 1-3 relevant files. If the query is out of scope, return empty. Respond in JSON: selected_files, reasoning, confidence.

[GATE] You decide whether to serve, annotate, or block knowledge base content based on a change signal. serve=harmless change, annotate=might affect accuracy, block=corrupted or destroyed. Respond in JSON: gate_decision, reasoning, risk_level, confidence.

Example user message (eval)

FILE: grammar/pouvoir.md
QUERY: how do you conjugate pouvoir in the present tense?

CHANGE:
@@ présent @@
-je peux
-nous pouvons
+je pit
+nous piton

OLD CONTENT:
<previous file content, truncated to ~1500 chars>

Evaluate this change: accept, reject, or partial.

Example output

{
  "verdict": "reject",
  "reasoning": "'je pit', 'nous piton' are not valid conjugations of pouvoir; the change looks corrupted.",
  "corrected_content": "no correction needed",
  "confidence": 0.93
}

Training


Base	Qwen3-8B
Method	QLoRA via Unsloth Studio
Hardware	single 3090 Ti, ~3 hours
Data	~7,000 examples (eval / routing / gate), generated locally with Qwen3-32B
Epochs	3
Export	Q4_K_M GGUF

The training data was generated locally at zero API cost, and the data-generation and fine-tune-prep scripts are in the repository. Both the Q4_K_M GGUF and the LoRA adapter are published here.

Limitations

It is a specialist, not a generalist. Peripheral was trained on French-grammar knowledge management, and while it transfers to other domains, accuracy falls off with distance: diff evaluation drops from 78% on the French content it was trained on to 70% on ClickHouse docs and 68% on the PostHog handbook.
Routing degrades as the knowledge base grows. It routes correctly about 95% of the time at roughly 20 files but slips to around 80% at 86 files, because more candidate files give the model more room to confuse itself.
It is a quality gate, not a truth gate. It reliably catches corruption, broken grammar, and structural problems, but it does not catch content that is fabricated yet plausible and has no traceable source, which is a problem that needs external verification rather than diff analysis.
It is sensitive to prompt format. Prompts that stray from the formats above degrade the output, so it is worth matching them closely.
The evaluation carries the bias of a single annotator. The test labels and the training labels come from the same person, inter-annotator agreement has not been measured, and the headline numbers are best read as directional rather than definitive.

License

This GGUF is a merged fine-tune of Qwen3-8B and is released under Apache-2.0, Qwen3's own license, with attribution to Qwen. The training and inference code is MIT.

Citation

@software{peripheral_kma_2026,
  author = {Madalitso Phiri},
  title  = {Peripheral: a fine-tuned small model for filesystem knowledge management},
  year   = {2026},
  url    = {https://github.com/malgamves/peripheral}
}

Downloads last month: 13

GGUF

Model size

8B params

Architecture

qwen3

Hardware compatibility

4-bit

Model tree for malgamves/peripheral-8b

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(331)

this model