VibeThinker-Fable-Nano-Agentic-3B

A 3B agentic coding model: a QLoRA SFT of zkxxxx/VibeThinker-3B-heretic (Qwen2.5-Coder-3B lineage, abliterated) distilled from FABLE.5 agent traces. The goal was to turn a long-form math/reasoning model into one that follows a tool-use loop and commits to complete answers โ€” the behavior an autonomous coding agent needs.

Results

Scored HumanEval + MBPP pass@1 (vLLM, temp 0.2, identical harness across all rows):

model HumanEval MBPP
base (VibeThinker-3B-heretic) 0.220 0.234
this model (agentic FT, 3B) 0.342 0.506
Qwen2.5-Coder-3B-Instruct 0.823 0.755
Qwen2.5-Coder-7B-Instruct 0.884 0.856
Qwen2.5-Coder-14B-Instruct-AWQ 0.915 0.841

Two honest takeaways:

  1. The fine-tune works. Over its own base, HumanEval improves +56% relative and MBPP more than doubles (0.234 โ†’ 0.506). A local behavioral head-to-head corroborates this: the FT reliably follows the chat/tool format and finishes its code, where the base and two sibling finetunes ramble in raw chain-of-thought and get cut off.
  2. A purpose-built coder of the same size is far stronger at raw accuracy (Qwen2.5-Coder-3B-Instruct 0.82 vs 0.34 on HumanEval). This is expected: VibeThinker is a math-reasoning base, this FT targeted agentic behavior (not HumanEval-maxxing), and it spends part of its token budget on <think>. The Qwen 7B/14B figures match their published numbers, confirming the harness is sound.

The clearest next step is to apply the same agentic SFT recipe on top of a real coder (e.g. Qwen2.5-Coder-3B-Instruct): the recipe is proven, and the base model is the ceiling.

Training

  • Method: QLoRA SFT (vanilla HF Trainer), load_best_model_at_end=True, cosine LR.
  • Data: ~160k agentic coding turns distilled from Crownelius/Complete-FABLE.5-traces-2M (MIT) โ€” OpenAI-style messages with tool_calls/tool roles; <think> reasoning preserved; refusals filtered out.
  • Schedule: 2 epochs / 4930 steps. Best checkpoint = 4250 (eval_loss 1.0638); epoch-2 lowered distill-loss but did not improve HumanEval (ckpt-2500 and ckpt-4250 both score 0.311), so a 3rd epoch was judged unnecessary.
  • Hardware: single GCP L4.

Files

  • GGUF quants for local inference (llama.cpp): vibethinker-final-Q4_K_M.gguf (1.9 GB, recommended) and vibethinker-final-Q8_0.gguf (3.1 GB, near-lossless).
  • adapter/ โ€” the LoRA adapter (resumable / mergeable onto the base with peft).

The full fp16 merged Transformers model (and an f16 GGUF) can be produced from the adapter + base, or requested โ€” omitted here for size. To rebuild: merge adapter/ onto zkxxxx/VibeThinker-3B-heretic with peft.

Intended use & limitations

  • Intended: agentic coding assistants that drive a tool-call loop (shell/edit/test), small enough to run locally.
  • Not a frontier coder: 3B; absolute HumanEval is modest. Use for agent behavior + on-device cost, not SOTA accuracy.
  • โš ๏ธ Safety: the base is abliterated ("heretic") and training filtered refusals, so this model has reduced safety guardrails and will attempt requests a safety-aligned model would decline. Do not deploy in user-facing settings without your own guardrails. Authorized/research/local use is the intended context.

Where it fits

The value here is agentic behavior + small size + no guardrails, not raw coding accuracy. Good fits:

  • On-device / offline coding agent โ€” Q4_K_M is ~1.8 GB (laptop / 8 GB GPU). In a tool-loop with execution feedback (run tests, read errors, retry), the format-following matters more than one-shot pass@1 because the loop self-corrects. This is its strongest practical niche.
  • Authorized security / red-team research โ€” no refusals; won't decline edge tasks a safety-aligned model blocks (pentest tooling, CTF, adversarial test generation) in a controlled context.
  • Synthetic agent-trace generation โ€” small, fast, consistent tool_call structure; cheap to bootstrap more agentic SFT data or stress-test a tool harness.
  • Research baseline / ablations โ€” a reproducible demonstration that agentic behavior transfers via distillation SFT; useful for studying tool-use format adherence or base-vs-recipe effects.
  • Cheap planner/drafter in a two-tier setup โ€” the small agent proposes tool calls; a larger model verifies.

Not for: production code correctness (use a dedicated coder), user-facing deployment without your own guardrails, or non-English / non-coding tasks.

License

MIT โ€” both the base model zkxxxx/VibeThinker-3B-heretic and the training data Crownelius/Complete-FABLE.5-traces-2M are MIT-licensed. Note the upstream lineage (Qwen2.5-Coder-3B) carries Qwen's own model terms; verify those if your use is commercial.

Downloads last month
101
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Nexlab/VibeThinker-Fable-Nano-Agentic-3B

Base model

Qwen/Qwen2.5-3B
Quantized
(4)
this model