magos-k8s-0.6b

A small (0.6B parameter) Kubernetes debugging assistant model. Fine-tuned from Qwen3-0.6B on Kubernetes documentation, the full Kubernetes API reference (every resource Kind), the kubectl command reference, and Prometheus alert runbooks.

Designed to run locally and be embedded in autonomous devops agents — outputs are heavily biased toward concrete, executable commands and YAML manifests an agent can apply directly.

What's new in v8 (vs v7)

v7 v8
Stage 2 training examples ~6,100 ~6,740 (+10%)
YAML bucket 780 unfiltered 521 schema-filtered — every example's apiVersion+fields validated against the K8s v1.34 OpenAPI spec; ~33% invented-field examples dropped before training
Anti-hallucination contrast bucket none ~317 new examples teaching wrong-vs-right pairs for kubectl flags, YAML field names, and diagnosis patterns mined from v7's actual failures
General-instruct mix none ~600 Alpaca examples (~9%) blended in to defend against catastrophic forgetting of base reasoning
Stage 2 LR / epochs 1.5e-5 / 2 epochs 1.5e-5 / 2 epochs (unchanged — proven recipe)
Stage 2 eval_loss 1.667 1.716 (slightly higher — expected, since 9% of examples are out-of-K8s-distribution Alpaca)

Why these changes

v7's main weaknesses surfaced in agent-usability review:

  1. Specific flag/field hallucinations: --show-namespace, --limit on kubectl logs, volumeAccessModes, autoscaling/v2beta3. We mined the actual hallucinations v7 produced across 75 benchmark verdicts (817 occurrences) and built targeted contrast pairs — for each known wrong pattern, a paired Q&A that explicitly contrasts it with the correct one.
  2. YAML schema invention: v7's YAML bucket was not validated post-synth. v8 runs each example through the v1.34 OpenAPI lookup and drops any example with >2 invented field paths.
  3. General-reasoning regression: v7 lost 3 points on the general bucket vs v6. v8 mixes in a small Alpaca slice so non-K8s prompts stay sharp.

Benchmarks (3-judge consensus, anonymized review of v6 vs v7 vs v8 across 25 prompts)

Each of 25 prompts was evaluated by 3 independent reviewers who saw the responses anonymized as A/B/C with the rubric for that prompt. Reviewers were forced to produce explicit reasoning, list verified facts and hallucinations, and rate agent_usable before assigning a 1-5 score. Final per-prompt score is the median of the 3 reviewers' scores.

Bucket Max v6 v7 v8
kubectl/CLI accuracy 30 8 10 14 (+4)
YAML manifest validity 25 6 11 12 (+1)
Debugging diagnose 30 9 10 8 (-2)
Prometheus runbook 25 7 7 6 (-1)
General reasoning 15 14 12 15 (+3)
Total 125 44 (35%) 50 (40%) 55 (44%)

Headline: v8 takes the largest single-version jump yet in kubectl accuracy (+4 points on a 30-point bucket) and recovers full general-reasoning performance, at a small cost in Diagnose and Runbook accuracy (-2 and -1). The Alpaca mix successfully defended against forgetting; the contrast bucket visibly suppressed the specific hallucinated flags v7 was repeating.

Honest absolute level: even v8 scores 44% on this benchmark. The judges grade strictly for agent-usability — a single invented flag or wrong apiVersion is enough to mark a response as not-executable. v8 is the best version of magos yet, but there is substantial room to grow toward 100%.

To pin a specific version when loading:

AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v8")
# or revision="v7" / "v6" / "v5" / "v3" / "v2" for previous versions

What it's good at

  • kubectl command construction — v8's strongest area. Real flags, correct flag forms, no --show-namespace/--limit-on-logs style inventions seen in v7.
  • YAML manifest generation — Pod, Deployment, Service, NetworkPolicy, PVC, HPA, ConfigMap, Secret, RBAC and ~70 other top-level Kinds all have correct apiVersion and field names (schema-validated training set).
  • Diagnosing pasted errorskubectl describe output, log lines, alert payloads → root cause + next-step suggestions
  • Prometheus alert handling — meaning + diagnostic steps for the prometheus-operator runbook set (KubePodCrashLooping, etcdBackendQuotaLowSpace, AlertmanagerClusterDown, etc.)
  • Agent-style outputs — short, command-first responses suitable for autonomous execution rather than human reading
  • Basic general reasoning — Alpaca mix preserves math, generic CS facts, short explanations

What it's not good at

  • Multi-step planning or complex tool chains — it's a 0.6B model
  • Subtle/rare flags — common flags are reliable; rare-but-real flags are still sometimes hallucinated. Always sanity-check with kubectl --help.
  • Multi-flag combinations on the same command — accuracy drops as flag count goes up
  • Knowledge of features released after the source docs were captured (mid-2026)
  • Long-form thinking — SFT suppressed Qwen3's <think> behavior

How to use

llama.cpp / Ollama / LM Studio

Three GGUF quantization levels are included — pick one:

File Size Quality
magos-k8s-0.6b-f16.gguf 1.2 GB reference (full bf16 precision)
magos-k8s-0.6b-q8_0.gguf 610 MB effectively identical to f16, half the size — recommended
magos-k8s-0.6b-q4_k_m.gguf 379 MB smallest. Some quality loss — kubectl flag/argument mistakes appear more often than with q8/f16. Fine for casual use, not recommended for accuracy-critical tasks.

Example with llama-cpp-python:

from llama_cpp import Llama

llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
resp = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Drain node worker-3 ignoring DaemonSets and deleting local-storage pods."}],
    temperature=0.05,
    repeat_penalty=1.15,
    max_tokens=512,
)
print(resp["choices"][0]["message"]["content"])

The temperature=0.05 and repeat_penalty=1.15 defaults are important — 0.6B models loop on longer structured outputs without a repetition penalty.

Hugging Face transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok   = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
                                             dtype="bfloat16",
                                             device_map="auto")

messages = [{"role": "user", "content": "Give me a NetworkPolicy that denies all egress from app pods except DNS."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=384,
                     do_sample=True, temperature=0.05,
                     top_p=0.95, top_k=20, repetition_penalty=1.15)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training

Base model Qwen/Qwen3-0.6B
Method Two stage: continued pre-training (CPT) → supervised fine-tuning (SFT). Both full-weight (no LoRA).
Stage 1 corpus 8.5k document chunks: kubernetes.io docs + blog (6.5k), Kubernetes API reference v1.34 (1.9k), Prometheus alert runbooks (106). Reused from v5/v6/v7 — corpus unchanged.
Stage 1 tokens ~6.5M
Stage 1 LR 5e-6, cosine, 3% warmup, 1 epoch
Stage 2 corpus (v8) 6,740 synthetic Q&A pairs. Distribution: K8s debugging (1.7k), K8s API field/schema (1.3k), Prometheus runbook (1.0k, 10 examples per runbook), kubectl reference (1.3k, 15 per subcommand), **schema-filtered YAML bucket (520)**, anti-hallucination contrast bucket (~317), general-instruct mix (~600)
Stage 2 LR 1.5e-5, cosine, 3% warmup, 2 epochs
Micro batch / grad accum 1 / 16 (effective batch 16)
Precision bfloat16
Sequence length 2048
Stage 1 eval_loss 1.71
Stage 2 eval_loss 1.72 (v7 was 1.67; the small regression reflects the 9% Alpaca slice being out-of-K8s-distribution — judge benchmark is the real measure)

Files

  • model.safetensors — fine-tuned weights, HF format (1.2 GB, bf16)
  • magos-k8s-0.6b-f16.gguf — GGUF, full precision (1.2 GB)
  • magos-k8s-0.6b-q8_0.gguf — GGUF, 8-bit quantization (610 MB)
  • magos-k8s-0.6b-q4_k_m.gguf — GGUF, 4-bit quantization (379 MB)
  • tokenizer.json, tokenizer_config.json — Qwen3 tokenizer
  • chat_template.jinja — Qwen3 ChatML template
  • config.json, generation_config.json — standard HF configs (with magos sampling defaults)

Limitations and intended use

This is a small experimental model. Always verify any command, YAML, or behavioral claim against current Kubernetes documentation before running in production. It is intended for learning, prototyping, and as a component in local devops agents — not as an authoritative source.

License

Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data is derived from the official Kubernetes documentation (CC-BY 4.0) and the prometheus-operator Prometheus runbooks (Apache 2.0).

Downloads last month
475
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with clglavan/magos-k8s-0.6b.

Model tree for clglavan/magos-k8s-0.6b

Finetuned
Qwen/Qwen3-0.6B
Quantized
(310)
this model