ProofKit Qwen 0.5B โ€” distilled (GGUF)

The llama.cpp / GGUF build of visproj/proofkit-distilled-qwen0.5b โ€” a Qwen 0.5B student distilled from ProofKit's fine-tuned gpt-oss-20b teacher. This is the default model the ProofKit Space serves: it runs free on CPU via llama.cpp, so the app works on a free Space with no GPU.

  • Quantization: q4_k_m (~400 MB)
  • Runtime: llama-cpp-python / llama.cpp
  • Chat template: Qwen2 (embedded in the GGUF metadata)

Usage

from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="visproj/proofkit-distilled-qwen0.5b-gguf",
    filename="*q4_k_m.gguf",
    n_ctx=4096,
)
resp = llm.create_chat_completion(
    messages=[{"role": "system", "content": SYSTEM}, {"role": "user", "content": PROMPT}],
    temperature=0.0,
)
print(resp["choices"][0]["message"]["content"])

Configure it in ProofKit with:

export PROOFKIT_DISTILLED_MODELS='ProofKit Qwen 0.5B Distilled=visproj/proofkit-distilled-qwen0.5b-gguf|*q4_k_m.gguf'

Evaluation (post-fix, 3-judge panel)

Mean score (0โ€“100) on 15 held-out prompts, graded by Claude Opus 4.7, GPT-5.5, and a local Qwen-3B (gpt-oss experts is a deliberately un-retrained stale control):

model Claude GPT-5.5 Qwen-3B Avg
gpt-5.5 (frontier ceiling) 94.6 95.6 90.8 93.7
gpt-oss attn (retrained teacher) 82.0 66.8 81.4 76.7
qwen-0.5b distilled (served) 79.0 68.6 82.2 76.6
qwen-0.5b direct 7k (served) 78.6 64.4 82.0 75.0
gpt-oss experts (stale control) 67.6 68.6 81.8 72.7
qwen-3b base 62.1 67.1 80.5 69.9
gpt-oss base 55.4 53.8 68.2 59.1
qwen-0.5b base 36.5 44.5 67.9 49.7

Both served retrained 0.5Bs beat the stale control and every untuned base across all three judges, and the distilled 0.5B โ‰ˆ ties its own 20B teacher.

About ProofKit

ProofKit is a work-sample generator for job seekers โ€” it turns a target role, background, and skills-to-prove into a realistic, clearly-fictional practice work sample (a role-specific challenge, a guided builder, a readiness review, and a recruiter-ready portfolio packet). Built for the Hugging Face Build Small Hackathon (Backyard AI track). Integrity rules are load-bearing: outputs never claim real employment, metrics are labeled hypothetical, and exports carry an ethical disclosure.

The ProofKit model family

Repo What it is
visproj/proofkit-qwen0.5b-7k Qwen2.5-0.5B fine-tuned directly on the 7k set (Transformers)
visproj/proofkit-gpt-oss-20b-lora gpt-oss-20b LoRA โ€” the distillation teacher
visproj/proofkit-distilled-qwen0.5b Qwen2.5-0.5B distilled from the teacher (merged)
visproj/proofkit-distilled-qwen0.5b-gguf GGUF of the distilled student (llama.cpp โ€” served)
visproj/proofkit-sft SFT dataset (synthetic, license-safe)
visproj/proofkit-distill-qwen0.5b Distillation dataset (teacher completions)

A note on training data (the "static responses" fix)

An earlier version of these models produced repetitive, input-ignoring drafts. The root cause was synthetic-data leakage: the dataset rendered the example user answers and the target from the same template slots, so the model learned target = template instead of target = f(input). The fix โ€” faithfulness anchors (a distinctive token shared by the answer and the target) + seeded per-example variation across every task, then a full-chain retrain โ€” is what these current weights reflect.

Prompt format is a frozen contract

These 0.5B models were trained on the exact prompt shapes from ProofKit's prompt_formats.py. They only behave well when prompted in that format; reworded or free-form prompts push them off-distribution. They are purpose-built components of the ProofKit app, not general chat models.

Downloads last month
-
GGUF
Model size
0.5B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for build-small-hackathon/proofkit-distilled-qwen0.5b-gguf

Quantized
(2)
this model