Instructions to use malgamves/peripheral-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use malgamves/peripheral-8b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="malgamves/peripheral-8b", filename="peripheral-8b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use malgamves/peripheral-8b with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf malgamves/peripheral-8b:Q4_K_M # Run inference directly in the terminal: llama cli -hf malgamves/peripheral-8b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf malgamves/peripheral-8b:Q4_K_M # Run inference directly in the terminal: llama cli -hf malgamves/peripheral-8b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf malgamves/peripheral-8b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf malgamves/peripheral-8b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf malgamves/peripheral-8b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf malgamves/peripheral-8b:Q4_K_M
Use Docker
docker model run hf.co/malgamves/peripheral-8b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use malgamves/peripheral-8b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "malgamves/peripheral-8b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "malgamves/peripheral-8b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/malgamves/peripheral-8b:Q4_K_M
- Ollama
How to use malgamves/peripheral-8b with Ollama:
ollama run hf.co/malgamves/peripheral-8b:Q4_K_M
- Unsloth Studio
How to use malgamves/peripheral-8b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for malgamves/peripheral-8b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for malgamves/peripheral-8b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for malgamves/peripheral-8b to start chatting
- Pi
How to use malgamves/peripheral-8b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf malgamves/peripheral-8b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "malgamves/peripheral-8b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use malgamves/peripheral-8b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf malgamves/peripheral-8b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default malgamves/peripheral-8b:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use malgamves/peripheral-8b with Docker Model Runner:
docker model run hf.co/malgamves/peripheral-8b:Q4_K_M
- Lemonade
How to use malgamves/peripheral-8b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull malgamves/peripheral-8b:Q4_K_M
Run and chat with the model
lemonade run user.peripheral-8b-Q4_K_M
List all available models
lemonade list
Peripheral (Qwen3-8B)
Peripheral is a fine-tuned 8B knowledge-management agent (KMA) that manages filesystem-based knowledge bases. Given a query it routes to the right files, decides whether a recent change is safe to serve, and classifies what that change actually is. It runs locally at zero cost per call, and on the knowledge-management tasks it was built for it beats frontier models while being roughly five times faster and five times cheaper.
This is the model behind the write-up Not all Context is Knowledge.
Peripheral is a specialist that was trained on about 7,000 task examples and beats a zero-shot generalist, but it transfers only partially to unseen domains and grows less reliable the further it gets from what it saw in training. Read the Limitations before you deploy it.
What it does
Peripheral is the judgement layer on top of a markdown knowledge base, and it performs three read-path tasks, each selected by a tag in the system prompt.
| Task | Tag | Question it answers | Output keys |
|---|---|---|---|
| Eval | [EVAL] |
Is this change correct, incorrect, or mixed? | verdict (accept/reject/partial), reasoning, corrected_content, confidence |
| Routing | [ROUTE] |
Which files answer this query? | selected_files, reasoning, confidence |
| Gate | [GATE] |
Serve, annotate, or block this content? | gate_decision (serve/annotate/block), reasoning, risk_level, confidence |
The output is always a single JSON object. There is a fourth write-path task, [ORGANIZE], for deciding where new knowledge belongs, but it is only scaffolded in the repository and was not trained into this model, so you should not rely on it.
Results (KMA-Bench, 226 cases, % correct)
Averaged over multiple runs across three knowledge bases (French wiki, ClickHouse docs, and the PostHog handbook, with 14 cases drawn from real git commits).
| Model | Diff Eval | Routing | Gate | Overall | Latency |
|---|---|---|---|---|---|
| Heuristic | 44% | 75% | 100% | 64% | 0ms |
| Base Qwen3-8B | 46%* | 13%* | 16%* | 30%* | ~3,500ms |
| Peripheral (this model) | 71% | 88% | 96% | 81% | ~760ms |
| Claude Sonnet 4.6 (zero-shot) | 60% | 96% | 83% | 76% | ~4,100ms |
* The base model cannot reliably emit valid JSON, and fine-tuning is what carried it from 30% to 81%.
A smaller 10-case quick benchmark is in the repo README; its numbers differ slightly because the sample is different.
Data note. The ClickHouse and PostHog knowledge bases were used only for read-only generalization testing, and the released model was trained on French content alone, so it contains none of their text. The ClickHouse documentation is ยฉ ClickHouse, used under CC BY-NC-SA 4.0, and the PostHog handbook is ยฉ PostHog Inc., used under MIT.
How to use
LM Studio (recommended)
- In LM Studio, search
malgamves/peripheral-8band download theQ4_K_MGGUF (it runs in about 4.8GB of VRAM). - Load the model.
- Give the model the task instruction for what you want, one of
[EVAL],[ROUTE], or[GATE](the full text is under "Prompt format" below), followed by your structured input. The[TAG]is what selects the task, and you can put the instruction in LM Studio's System Prompt field or at the top of your message. - Send it. You get a single JSON object back.
Keep the temperature low, around 0.1, since the model is trained to emit short, deterministic JSON.
llama.cpp
Download the Q4_K_M GGUF and run it with your usual llama.cpp setup, using the same prompt format as above.
Ollama (optional)
A Modelfile is in the repository if you prefer Ollama: ollama create it, with the [EVAL] prompt baked in as the default system message, and then ollama run. LM Studio is still the recommended path.
Prompt format
Each task is a tagged instruction followed by a structured input. The [TAG] is the operative signal that selects the task, and whether you place the instruction in the system field or inline with your input, you should keep the format close to the examples below, because the model is sensitive to it.
System prompts
[EVAL] You evaluate changes to a knowledge base. Given a diff and old content, assess whether the change is correct (accept), incorrect (reject), or mixed (partial). Respond in JSON: verdict, reasoning, corrected_content, confidence.
[ROUTE] You select files from a knowledge base to answer a query. Given a query and file list, select 1-3 relevant files. If the query is out of scope, return empty. Respond in JSON: selected_files, reasoning, confidence.
[GATE] You decide whether to serve, annotate, or block knowledge base content based on a change signal. serve=harmless change, annotate=might affect accuracy, block=corrupted or destroyed. Respond in JSON: gate_decision, reasoning, risk_level, confidence.
Example user message (eval)
FILE: grammar/pouvoir.md
QUERY: how do you conjugate pouvoir in the present tense?
CHANGE:
@@ prรฉsent @@
-je peux
-nous pouvons
+je pit
+nous piton
OLD CONTENT:
<previous file content, truncated to ~1500 chars>
Evaluate this change: accept, reject, or partial.
Example output
{
"verdict": "reject",
"reasoning": "'je pit', 'nous piton' are not valid conjugations of pouvoir; the change looks corrupted.",
"corrected_content": "no correction needed",
"confidence": 0.93
}
Training
| Base | Qwen3-8B |
| Method | QLoRA via Unsloth Studio |
| Hardware | single 3090 Ti, ~3 hours |
| Data | ~7,000 examples (eval / routing / gate), generated locally with Qwen3-32B |
| Epochs | 3 |
| Export | Q4_K_M GGUF |
The training data was generated locally at zero API cost, and the data-generation and fine-tune-prep scripts are in the repository. Both the Q4_K_M GGUF and the LoRA adapter are published here.
Limitations
- It is a specialist, not a generalist. Peripheral was trained on French-grammar knowledge management, and while it transfers to other domains, accuracy falls off with distance: diff evaluation drops from 78% on the French content it was trained on to 70% on ClickHouse docs and 68% on the PostHog handbook.
- Routing degrades as the knowledge base grows. It routes correctly about 95% of the time at roughly 20 files but slips to around 80% at 86 files, because more candidate files give the model more room to confuse itself.
- It is a quality gate, not a truth gate. It reliably catches corruption, broken grammar, and structural problems, but it does not catch content that is fabricated yet plausible and has no traceable source, which is a problem that needs external verification rather than diff analysis.
- It is sensitive to prompt format. Prompts that stray from the formats above degrade the output, so it is worth matching them closely.
- The evaluation carries the bias of a single annotator. The test labels and the training labels come from the same person, inter-annotator agreement has not been measured, and the headline numbers are best read as directional rather than definitive.
License
This GGUF is a merged fine-tune of Qwen3-8B and is released under Apache-2.0, Qwen3's own license, with attribution to Qwen. The training and inference code is MIT.
Citation
@software{peripheral_kma_2026,
author = {Madalitso Phiri},
title = {Peripheral: a fine-tuned small model for filesystem knowledge management},
year = {2026},
url = {https://github.com/malgamves/peripheral}
}
- Downloads last month
- 13
4-bit