Instructions to use clglavan/magos-k8s-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use clglavan/magos-k8s-0.6b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="clglavan/magos-k8s-0.6b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use clglavan/magos-k8s-0.6b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="clglavan/magos-k8s-0.6b",
	filename="magos-k8s-0.6b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use clglavan/magos-k8s-0.6b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Use Docker

docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M

LM Studio
Jan

vLLM

How to use clglavan/magos-k8s-0.6b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "clglavan/magos-k8s-0.6b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M

SGLang

How to use clglavan/magos-k8s-0.6b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "clglavan/magos-k8s-0.6b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "clglavan/magos-k8s-0.6b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use clglavan/magos-k8s-0.6b with Ollama:
```
ollama run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
```

Unsloth Studio new

How to use clglavan/magos-k8s-0.6b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

Pi new

How to use clglavan/magos-k8s-0.6b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "clglavan/magos-k8s-0.6b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use clglavan/magos-k8s-0.6b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default clglavan/magos-k8s-0.6b:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use clglavan/magos-k8s-0.6b with Docker Model Runner:
```
docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
```

Lemonade

How to use clglavan/magos-k8s-0.6b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull clglavan/magos-k8s-0.6b:Q4_K_M

Run and chat with the model

lemonade run user.magos-k8s-0.6b-Q4_K_M

List all available models

lemonade list

magos-k8s-0.6b

A small (0.6B parameter) Kubernetes debugging assistant model. Fine-tuned from Qwen3-0.6B on Kubernetes documentation, the full Kubernetes API reference (every resource Kind), the kubectl command reference, and Prometheus alert runbooks.

Designed to run locally and be embedded in autonomous devops agents — outputs are heavily biased toward concrete, executable commands and YAML manifests an agent can apply directly.

What's new in v8 (vs v7)

	v7	v8
Stage 2 training examples	~6,100	~6,740 (+10%)
YAML bucket	780 unfiltered	521 schema-filtered — every example's apiVersion+fields validated against the K8s v1.34 OpenAPI spec; ~33% invented-field examples dropped before training
Anti-hallucination contrast bucket	none	~317 new examples teaching wrong-vs-right pairs for kubectl flags, YAML field names, and diagnosis patterns mined from v7's actual failures
General-instruct mix	none	~600 Alpaca examples (~9%) blended in to defend against catastrophic forgetting of base reasoning
Stage 2 LR / epochs	1.5e-5 / 2 epochs	1.5e-5 / 2 epochs (unchanged — proven recipe)
Stage 2 eval_loss	1.667	1.716 (slightly higher — expected, since 9% of examples are out-of-K8s-distribution Alpaca)

Why these changes

v7's main weaknesses surfaced in agent-usability review:

Specific flag/field hallucinations: --show-namespace, --limit on kubectl logs, volumeAccessModes, autoscaling/v2beta3. We mined the actual hallucinations v7 produced across 75 benchmark verdicts (817 occurrences) and built targeted contrast pairs — for each known wrong pattern, a paired Q&A that explicitly contrasts it with the correct one.
YAML schema invention: v7's YAML bucket was not validated post-synth. v8 runs each example through the v1.34 OpenAPI lookup and drops any example with >2 invented field paths.
General-reasoning regression: v7 lost 3 points on the general bucket vs v6. v8 mixes in a small Alpaca slice so non-K8s prompts stay sharp.

Benchmarks (3-judge consensus, anonymized review of v6 vs v7 vs v8 across 25 prompts)

Each of 25 prompts was evaluated by 3 independent reviewers who saw the responses anonymized as A/B/C with the rubric for that prompt. Reviewers were forced to produce explicit reasoning, list verified facts and hallucinations, and rate agent_usable before assigning a 1-5 score. Final per-prompt score is the median of the 3 reviewers' scores.

Bucket	Max	v6	v7	v8
kubectl/CLI accuracy	30	8	10	14 (+4)
YAML manifest validity	25	6	11	12 (+1)
Debugging diagnose	30	9	10	8 (-2)
Prometheus runbook	25	7	7	6 (-1)
General reasoning	15	14	12	15 (+3)
Total	125	44 (35%)	50 (40%)	55 (44%)

Headline: v8 takes the largest single-version jump yet in kubectl accuracy (+4 points on a 30-point bucket) and recovers full general-reasoning performance, at a small cost in Diagnose and Runbook accuracy (-2 and -1). The Alpaca mix successfully defended against forgetting; the contrast bucket visibly suppressed the specific hallucinated flags v7 was repeating.

Honest absolute level: even v8 scores 44% on this benchmark. The judges grade strictly for agent-usability — a single invented flag or wrong apiVersion is enough to mark a response as not-executable. v8 is the best version of magos yet, but there is substantial room to grow toward 100%.

To pin a specific version when loading:

AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v8")
# or revision="v7" / "v6" / "v5" / "v3" / "v2" for previous versions

What it's good at

kubectl command construction — v8's strongest area. Real flags, correct flag forms, no --show-namespace/--limit-on-logs style inventions seen in v7.
YAML manifest generation — Pod, Deployment, Service, NetworkPolicy, PVC, HPA, ConfigMap, Secret, RBAC and ~70 other top-level Kinds all have correct apiVersion and field names (schema-validated training set).
Diagnosing pasted errors — kubectl describe output, log lines, alert payloads → root cause + next-step suggestions
Prometheus alert handling — meaning + diagnostic steps for the prometheus-operator runbook set (KubePodCrashLooping, etcdBackendQuotaLowSpace, AlertmanagerClusterDown, etc.)
Agent-style outputs — short, command-first responses suitable for autonomous execution rather than human reading
Basic general reasoning — Alpaca mix preserves math, generic CS facts, short explanations

What it's not good at

Multi-step planning or complex tool chains — it's a 0.6B model
Subtle/rare flags — common flags are reliable; rare-but-real flags are still sometimes hallucinated. Always sanity-check with kubectl --help.
Multi-flag combinations on the same command — accuracy drops as flag count goes up
Knowledge of features released after the source docs were captured (mid-2026)
Long-form thinking — SFT suppressed Qwen3's <think> behavior

How to use

llama.cpp / Ollama / LM Studio

Three GGUF quantization levels are included — pick one:

File	Size	Quality
`magos-k8s-0.6b-f16.gguf`	1.2 GB	reference (full bf16 precision)
`magos-k8s-0.6b-q8_0.gguf`	610 MB	effectively identical to f16, half the size — recommended
`magos-k8s-0.6b-q4_k_m.gguf`	379 MB	smallest. Some quality loss — `kubectl` flag/argument mistakes appear more often than with q8/f16. Fine for casual use, not recommended for accuracy-critical tasks.

Example with llama-cpp-python:

from llama_cpp import Llama

llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
resp = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Drain node worker-3 ignoring DaemonSets and deleting local-storage pods."}],
    temperature=0.05,
    repeat_penalty=1.15,
    max_tokens=512,
)
print(resp["choices"][0]["message"]["content"])

The temperature=0.05 and repeat_penalty=1.15 defaults are important — 0.6B models loop on longer structured outputs without a repetition penalty.

Hugging Face transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok   = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
                                             dtype="bfloat16",
                                             device_map="auto")

messages = [{"role": "user", "content": "Give me a NetworkPolicy that denies all egress from app pods except DNS."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=384,
                     do_sample=True, temperature=0.05,
                     top_p=0.95, top_k=20, repetition_penalty=1.15)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training


Base model	Qwen/Qwen3-0.6B
Method	Two stage: continued pre-training (CPT) → supervised fine-tuning (SFT). Both full-weight (no LoRA).
Stage 1 corpus	~~8.5k document chunks: kubernetes.io docs + blog (~~6.5k), Kubernetes API reference v1.34 (~~1.9k), Prometheus alert runbooks (~~106). Reused from v5/v6/v7 — corpus unchanged.
Stage 1 tokens	~6.5M
Stage 1 LR	5e-6, cosine, 3% warmup, 1 epoch
Stage 2 corpus (v8)	~~6,740 synthetic Q&A pairs. Distribution: K8s debugging (~~1.7k), K8s API field/schema (~~1.3k), Prometheus runbook (~~1.0k, 10 examples per runbook), kubectl reference (~~1.3k, 15 per subcommand), schema-filtered YAML bucket (~~520), anti-hallucination contrast bucket (~317), general-instruct mix (~600)
Stage 2 LR	1.5e-5, cosine, 3% warmup, 2 epochs
Micro batch / grad accum	1 / 16 (effective batch 16)
Precision	bfloat16
Sequence length	2048
Stage 1 eval_loss	1.71
Stage 2 eval_loss	1.72 (v7 was 1.67; the small regression reflects the 9% Alpaca slice being out-of-K8s-distribution — judge benchmark is the real measure)

Files

model.safetensors — fine-tuned weights, HF format (1.2 GB, bf16)
magos-k8s-0.6b-f16.gguf — GGUF, full precision (1.2 GB)
magos-k8s-0.6b-q8_0.gguf — GGUF, 8-bit quantization (610 MB)
magos-k8s-0.6b-q4_k_m.gguf — GGUF, 4-bit quantization (379 MB)
tokenizer.json, tokenizer_config.json — Qwen3 tokenizer
chat_template.jinja — Qwen3 ChatML template
config.json, generation_config.json — standard HF configs (with magos sampling defaults)

Limitations and intended use

This is a small experimental model. Always verify any command, YAML, or behavioral claim against current Kubernetes documentation before running in production. It is intended for learning, prototyping, and as a component in local devops agents — not as an authoritative source.

License

Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data is derived from the official Kubernetes documentation (CC-BY 4.0) and the prometheus-operator Prometheus runbooks (Apache 2.0).

Downloads last month: 475

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for clglavan/magos-k8s-0.6b

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Quantized

(310)

this model