Vinci

Vinci Piccolo 1.0 — GGUF

GGUF (quantized) builds of Vinci Piccolo, for local inference with Ollama, LM Studio, and llama.cpp. For the full-precision weights, evals, and details, see simpledirect/Vinci-Piccolo-1.0.

Available variants

File Size Min RAM Notes
vinci-piccolo-1.0-20260629-Q6_K.gguf 3.46 GB 12 GB Closest to BF16 quality
vinci-piccolo-1.0-20260629-Q5_K_M.gguf 3.07 GB 10 GB Good balance (recommended)
vinci-piccolo-1.0-20260629-Q4_K_M.gguf 2.71 GB 8 GB Smallest, tight memory budgets

GPU: Q5_K_M and Q4_K_M run on 4 GB VRAM; Q6_K needs 6 GB. Mac M-series: Q5_K_M fits on 8 GB unified memory; Q6_K needs 16 GB.

Ollama

ollama run hf.co/simpledirect/Vinci-Piccolo-1.0-GGUF

llama.cpp

./llama-cli \
    -m vinci-piccolo-1.0-20260629-Q5_K_M.gguf \
    --ctx-size 262144 \
    --temp 0 \
    --chat-template qwen3

llama-server (OpenAI-compatible API)

./llama-server \
    -m vinci-piccolo-1.0-20260629-Q5_K_M.gguf \
    --ctx-size 262144 \
    --host 0.0.0.0 \
    --port 8080

Prompt format

Qwen / ChatML chat template. No system prompt required — character is trained into the weights. Pass enable_thinking=False when using the tokenizer directly to suppress <think> output.

Citation

@misc{simpledirect2026vinci,
  title        = {Vinci Piccolo 1.0},
  author       = {{SimpleDirect}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/simpledirect/Vinci-Piccolo-1.0}},
  note         = {Apache 2.0. Fine-tuned from Qwen/Qwen3.5-4B.},
}
Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for simpledirect/Vinci-Piccolo-1.0-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(1)
this model

Collection including simpledirect/Vinci-Piccolo-1.0-GGUF