How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="malgamves/peripheral-8b",
	filename="peripheral-8b-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Peripheral (Qwen3-8B)

Peripheral is a fine-tuned 8B knowledge-management agent (KMA) that manages filesystem-based knowledge bases. Given a query it routes to the right files, decides whether a recent change is safe to serve, and classifies what that change actually is. It runs locally at zero cost per call, and on the knowledge-management tasks it was built for it beats frontier models while being roughly five times faster and five times cheaper.

This is the model behind the write-up Not all Context is Knowledge.

Peripheral is a specialist that was trained on about 7,000 task examples and beats a zero-shot generalist, but it transfers only partially to unseen domains and grows less reliable the further it gets from what it saw in training. Read the Limitations before you deploy it.

What it does

Peripheral is the judgement layer on top of a markdown knowledge base, and it performs three read-path tasks, each selected by a tag in the system prompt.

Task Tag Question it answers Output keys
Eval [EVAL] Is this change correct, incorrect, or mixed? verdict (accept/reject/partial), reasoning, corrected_content, confidence
Routing [ROUTE] Which files answer this query? selected_files, reasoning, confidence
Gate [GATE] Serve, annotate, or block this content? gate_decision (serve/annotate/block), reasoning, risk_level, confidence

The output is always a single JSON object. There is a fourth write-path task, [ORGANIZE], for deciding where new knowledge belongs, but it is only scaffolded in the repository and was not trained into this model, so you should not rely on it.

Results (KMA-Bench, 226 cases, % correct)

Averaged over multiple runs across three knowledge bases (French wiki, ClickHouse docs, and the PostHog handbook, with 14 cases drawn from real git commits).

Model Diff Eval Routing Gate Overall Latency
Heuristic 44% 75% 100% 64% 0ms
Base Qwen3-8B 46%* 13%* 16%* 30%* ~3,500ms
Peripheral (this model) 71% 88% 96% 81% ~760ms
Claude Sonnet 4.6 (zero-shot) 60% 96% 83% 76% ~4,100ms

* The base model cannot reliably emit valid JSON, and fine-tuning is what carried it from 30% to 81%.

A smaller 10-case quick benchmark is in the repo README; its numbers differ slightly because the sample is different.

Data note. The ClickHouse and PostHog knowledge bases were used only for read-only generalization testing, and the released model was trained on French content alone, so it contains none of their text. The ClickHouse documentation is ยฉ ClickHouse, used under CC BY-NC-SA 4.0, and the PostHog handbook is ยฉ PostHog Inc., used under MIT.

How to use

LM Studio (recommended)

  1. In LM Studio, search malgamves/peripheral-8b and download the Q4_K_M GGUF (it runs in about 4.8GB of VRAM).
  2. Load the model.
  3. Give the model the task instruction for what you want, one of [EVAL], [ROUTE], or [GATE] (the full text is under "Prompt format" below), followed by your structured input. The [TAG] is what selects the task, and you can put the instruction in LM Studio's System Prompt field or at the top of your message.
  4. Send it. You get a single JSON object back.

Keep the temperature low, around 0.1, since the model is trained to emit short, deterministic JSON.

llama.cpp

Download the Q4_K_M GGUF and run it with your usual llama.cpp setup, using the same prompt format as above.

Ollama (optional)

A Modelfile is in the repository if you prefer Ollama: ollama create it, with the [EVAL] prompt baked in as the default system message, and then ollama run. LM Studio is still the recommended path.

Prompt format

Each task is a tagged instruction followed by a structured input. The [TAG] is the operative signal that selects the task, and whether you place the instruction in the system field or inline with your input, you should keep the format close to the examples below, because the model is sensitive to it.

System prompts

[EVAL] You evaluate changes to a knowledge base. Given a diff and old content, assess whether the change is correct (accept), incorrect (reject), or mixed (partial). Respond in JSON: verdict, reasoning, corrected_content, confidence.

[ROUTE] You select files from a knowledge base to answer a query. Given a query and file list, select 1-3 relevant files. If the query is out of scope, return empty. Respond in JSON: selected_files, reasoning, confidence.

[GATE] You decide whether to serve, annotate, or block knowledge base content based on a change signal. serve=harmless change, annotate=might affect accuracy, block=corrupted or destroyed. Respond in JSON: gate_decision, reasoning, risk_level, confidence.

Example user message (eval)

FILE: grammar/pouvoir.md
QUERY: how do you conjugate pouvoir in the present tense?

CHANGE:
@@ prรฉsent @@
-je peux
-nous pouvons
+je pit
+nous piton

OLD CONTENT:
<previous file content, truncated to ~1500 chars>

Evaluate this change: accept, reject, or partial.

Example output

{
  "verdict": "reject",
  "reasoning": "'je pit', 'nous piton' are not valid conjugations of pouvoir; the change looks corrupted.",
  "corrected_content": "no correction needed",
  "confidence": 0.93
}

Training

Base Qwen3-8B
Method QLoRA via Unsloth Studio
Hardware single 3090 Ti, ~3 hours
Data ~7,000 examples (eval / routing / gate), generated locally with Qwen3-32B
Epochs 3
Export Q4_K_M GGUF

The training data was generated locally at zero API cost, and the data-generation and fine-tune-prep scripts are in the repository. Both the Q4_K_M GGUF and the LoRA adapter are published here.

Limitations

  • It is a specialist, not a generalist. Peripheral was trained on French-grammar knowledge management, and while it transfers to other domains, accuracy falls off with distance: diff evaluation drops from 78% on the French content it was trained on to 70% on ClickHouse docs and 68% on the PostHog handbook.
  • Routing degrades as the knowledge base grows. It routes correctly about 95% of the time at roughly 20 files but slips to around 80% at 86 files, because more candidate files give the model more room to confuse itself.
  • It is a quality gate, not a truth gate. It reliably catches corruption, broken grammar, and structural problems, but it does not catch content that is fabricated yet plausible and has no traceable source, which is a problem that needs external verification rather than diff analysis.
  • It is sensitive to prompt format. Prompts that stray from the formats above degrade the output, so it is worth matching them closely.
  • The evaluation carries the bias of a single annotator. The test labels and the training labels come from the same person, inter-annotator agreement has not been measured, and the headline numbers are best read as directional rather than definitive.

License

This GGUF is a merged fine-tune of Qwen3-8B and is released under Apache-2.0, Qwen3's own license, with attribution to Qwen. The training and inference code is MIT.

Citation

@software{peripheral_kma_2026,
  author = {Madalitso Phiri},
  title  = {Peripheral: a fine-tuned small model for filesystem knowledge management},
  year   = {2026},
  url    = {https://github.com/malgamves/peripheral}
}
Downloads last month
13
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for malgamves/peripheral-8b

Finetuned
Qwen/Qwen3-8B
Quantized
(331)
this model