Instructions to use clglavan/magos-k8s-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use clglavan/magos-k8s-0.6b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="clglavan/magos-k8s-0.6b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b") model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use clglavan/magos-k8s-0.6b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="clglavan/magos-k8s-0.6b", filename="magos-k8s-0.6b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use clglavan/magos-k8s-0.6b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M
Use Docker
docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use clglavan/magos-k8s-0.6b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "clglavan/magos-k8s-0.6b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "clglavan/magos-k8s-0.6b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
- SGLang
How to use clglavan/magos-k8s-0.6b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "clglavan/magos-k8s-0.6b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "clglavan/magos-k8s-0.6b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "clglavan/magos-k8s-0.6b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "clglavan/magos-k8s-0.6b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use clglavan/magos-k8s-0.6b with Ollama:
ollama run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
- Unsloth Studio new
How to use clglavan/magos-k8s-0.6b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for clglavan/magos-k8s-0.6b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for clglavan/magos-k8s-0.6b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for clglavan/magos-k8s-0.6b to start chatting
- Pi new
How to use clglavan/magos-k8s-0.6b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "clglavan/magos-k8s-0.6b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use clglavan/magos-k8s-0.6b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default clglavan/magos-k8s-0.6b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use clglavan/magos-k8s-0.6b with Docker Model Runner:
docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
- Lemonade
How to use clglavan/magos-k8s-0.6b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull clglavan/magos-k8s-0.6b:Q4_K_M
Run and chat with the model
lemonade run user.magos-k8s-0.6b-Q4_K_M
List all available models
lemonade list
magos-k8s-0.6b
A small (0.6B parameter) Kubernetes debugging assistant model. Fine-tuned from Qwen3-0.6B on Kubernetes documentation, the full Kubernetes API reference (every resource Kind), the kubectl command reference, and Prometheus alert runbooks.
Designed to run locally and be embedded in autonomous devops agents — outputs are heavily biased toward concrete, executable commands and YAML manifests an agent can apply directly.
What's new in v8 (vs v7)
| v7 | v8 | |
|---|---|---|
| Stage 2 training examples | ~6,100 | ~6,740 (+10%) |
| YAML bucket | 780 unfiltered | 521 schema-filtered — every example's apiVersion+fields validated against the K8s v1.34 OpenAPI spec; ~33% invented-field examples dropped before training |
| Anti-hallucination contrast bucket | none | ~317 new examples teaching wrong-vs-right pairs for kubectl flags, YAML field names, and diagnosis patterns mined from v7's actual failures |
| General-instruct mix | none | ~600 Alpaca examples (~9%) blended in to defend against catastrophic forgetting of base reasoning |
| Stage 2 LR / epochs | 1.5e-5 / 2 epochs | 1.5e-5 / 2 epochs (unchanged — proven recipe) |
| Stage 2 eval_loss | 1.667 | 1.716 (slightly higher — expected, since 9% of examples are out-of-K8s-distribution Alpaca) |
Why these changes
v7's main weaknesses surfaced in agent-usability review:
- Specific flag/field hallucinations:
--show-namespace,--limitonkubectl logs,volumeAccessModes,autoscaling/v2beta3. We mined the actual hallucinations v7 produced across 75 benchmark verdicts (817 occurrences) and built targeted contrast pairs — for each known wrong pattern, a paired Q&A that explicitly contrasts it with the correct one. - YAML schema invention: v7's YAML bucket was not validated post-synth. v8 runs each example through the v1.34 OpenAPI lookup and drops any example with >2 invented field paths.
- General-reasoning regression: v7 lost 3 points on the general bucket vs v6. v8 mixes in a small Alpaca slice so non-K8s prompts stay sharp.
Benchmarks (3-judge consensus, anonymized review of v6 vs v7 vs v8 across 25 prompts)
Each of 25 prompts was evaluated by 3 independent reviewers who saw the
responses anonymized as A/B/C with the rubric for that prompt. Reviewers
were forced to produce explicit reasoning, list verified facts and
hallucinations, and rate agent_usable before assigning a 1-5 score. Final
per-prompt score is the median of the 3 reviewers' scores.
| Bucket | Max | v6 | v7 | v8 |
|---|---|---|---|---|
| kubectl/CLI accuracy | 30 | 8 | 10 | 14 (+4) |
| YAML manifest validity | 25 | 6 | 11 | 12 (+1) |
| Debugging diagnose | 30 | 9 | 10 | 8 (-2) |
| Prometheus runbook | 25 | 7 | 7 | 6 (-1) |
| General reasoning | 15 | 14 | 12 | 15 (+3) |
| Total | 125 | 44 (35%) | 50 (40%) | 55 (44%) |
Headline: v8 takes the largest single-version jump yet in kubectl accuracy (+4 points on a 30-point bucket) and recovers full general-reasoning performance, at a small cost in Diagnose and Runbook accuracy (-2 and -1). The Alpaca mix successfully defended against forgetting; the contrast bucket visibly suppressed the specific hallucinated flags v7 was repeating.
Honest absolute level: even v8 scores 44% on this benchmark. The judges grade strictly for agent-usability — a single invented flag or wrong apiVersion is enough to mark a response as not-executable. v8 is the best version of magos yet, but there is substantial room to grow toward 100%.
To pin a specific version when loading:
AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v8")
# or revision="v7" / "v6" / "v5" / "v3" / "v2" for previous versions
What it's good at
- kubectl command construction — v8's strongest area. Real flags,
correct flag forms, no
--show-namespace/--limit-on-logsstyle inventions seen in v7. - YAML manifest generation — Pod, Deployment, Service, NetworkPolicy, PVC, HPA, ConfigMap, Secret, RBAC and ~70 other top-level Kinds all have correct apiVersion and field names (schema-validated training set).
- Diagnosing pasted errors —
kubectl describeoutput, log lines, alert payloads → root cause + next-step suggestions - Prometheus alert handling — meaning + diagnostic steps for the prometheus-operator runbook set (KubePodCrashLooping, etcdBackendQuotaLowSpace, AlertmanagerClusterDown, etc.)
- Agent-style outputs — short, command-first responses suitable for autonomous execution rather than human reading
- Basic general reasoning — Alpaca mix preserves math, generic CS facts, short explanations
What it's not good at
- Multi-step planning or complex tool chains — it's a 0.6B model
- Subtle/rare flags — common flags are reliable; rare-but-real flags are
still sometimes hallucinated. Always sanity-check with
kubectl --help. - Multi-flag combinations on the same command — accuracy drops as flag count goes up
- Knowledge of features released after the source docs were captured (mid-2026)
- Long-form thinking — SFT suppressed Qwen3's
<think>behavior
How to use
llama.cpp / Ollama / LM Studio
Three GGUF quantization levels are included — pick one:
| File | Size | Quality |
|---|---|---|
magos-k8s-0.6b-f16.gguf |
1.2 GB | reference (full bf16 precision) |
magos-k8s-0.6b-q8_0.gguf |
610 MB | effectively identical to f16, half the size — recommended |
magos-k8s-0.6b-q4_k_m.gguf |
379 MB | smallest. Some quality loss — kubectl flag/argument mistakes appear more often than with q8/f16. Fine for casual use, not recommended for accuracy-critical tasks. |
Example with llama-cpp-python:
from llama_cpp import Llama
llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
resp = llm.create_chat_completion(
messages=[{"role": "user", "content": "Drain node worker-3 ignoring DaemonSets and deleting local-storage pods."}],
temperature=0.05,
repeat_penalty=1.15,
max_tokens=512,
)
print(resp["choices"][0]["message"]["content"])
The temperature=0.05 and repeat_penalty=1.15 defaults are important —
0.6B models loop on longer structured outputs without a repetition penalty.
Hugging Face transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
dtype="bfloat16",
device_map="auto")
messages = [{"role": "user", "content": "Give me a NetworkPolicy that denies all egress from app pods except DNS."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=384,
do_sample=True, temperature=0.05,
top_p=0.95, top_k=20, repetition_penalty=1.15)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training
| Base model | Qwen/Qwen3-0.6B |
| Method | Two stage: continued pre-training (CPT) → supervised fine-tuning (SFT). Both full-weight (no LoRA). |
| Stage 1 corpus | |
| Stage 1 tokens | ~6.5M |
| Stage 1 LR | 5e-6, cosine, 3% warmup, 1 epoch |
| Stage 2 corpus (v8) | |
| Stage 2 LR | 1.5e-5, cosine, 3% warmup, 2 epochs |
| Micro batch / grad accum | 1 / 16 (effective batch 16) |
| Precision | bfloat16 |
| Sequence length | 2048 |
| Stage 1 eval_loss | 1.71 |
| Stage 2 eval_loss | 1.72 (v7 was 1.67; the small regression reflects the 9% Alpaca slice being out-of-K8s-distribution — judge benchmark is the real measure) |
Files
model.safetensors— fine-tuned weights, HF format (1.2 GB, bf16)magos-k8s-0.6b-f16.gguf— GGUF, full precision (1.2 GB)magos-k8s-0.6b-q8_0.gguf— GGUF, 8-bit quantization (610 MB)magos-k8s-0.6b-q4_k_m.gguf— GGUF, 4-bit quantization (379 MB)tokenizer.json,tokenizer_config.json— Qwen3 tokenizerchat_template.jinja— Qwen3 ChatML templateconfig.json,generation_config.json— standard HF configs (with magos sampling defaults)
Limitations and intended use
This is a small experimental model. Always verify any command, YAML, or behavioral claim against current Kubernetes documentation before running in production. It is intended for learning, prototyping, and as a component in local devops agents — not as an authoritative source.
License
Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data is derived from the official Kubernetes documentation (CC-BY 4.0) and the prometheus-operator Prometheus runbooks (Apache 2.0).
- Downloads last month
- 475