Supertron2.1-0.6B-GGUF

Supertron2.1-0.6B-GGUF contains GGUF exports of Surpem/Supertron2.1-0.6B, a compact Qwen3-based generalist model by Surpem.

This repository is for local inference with llama.cpp, LM Studio, Jan, KoboldCpp, text-generation-webui, and other GGUF-compatible runtimes. The original Transformers checkpoint is available at Surpem/Supertron2.1-0.6B.

Available Files

File Type Size Recommended Use
gguf/Supertron2.1-0.6B-F16.gguf F16 ~448 MiB Highest quality GGUF, larger memory use
gguf/Supertron2.1-0.6B-Q8_0.gguf 8-bit ~610 MiB Strong quality, efficient local use
gguf/Supertron2.1-0.6B-Q4_K_M.gguf 4-bit K-quants ~378 MiB Small, fast, best for low-memory devices

Which GGUF Should I Use?

Q4_K_M

Use this when you want the smallest practical model.

Good for:

  • laptops
  • CPU inference
  • fast testing
  • low VRAM
  • general chat

Tradeoff: slightly lower quality than Q8/F16.

Q8_0

Use this when you want better quality while keeping the file smaller than full precision.

Good for:

  • local coding help
  • math prompts
  • better instruction following
  • GPU offload with modest VRAM

Tradeoff: larger than Q4.

F16

Use this when quality matters most and memory is available.

Good for:

  • comparison testing
  • re-quantization
  • quality checks
  • development workflows

Tradeoff: largest runtime memory use.

llama.cpp Usage

Install or build llama.cpp, then run:

llama-cli \
  -m gguf/Supertron2.1-0.6B-Q4_K_M.gguf \
  -p "Write a Python function that returns the nth Fibonacci number." \
  -n 256

For chat-style prompting:

llama-cli \
  -m gguf/Supertron2.1-0.6B-Q8_0.gguf \
  -cnv \
  --color \
  -p "You are Supertron, a helpful coding and math assistant."

With GPU offload:

llama-cli \
  -m gguf/Supertron2.1-0.6B-Q4_K_M.gguf \
  -ngl 99 \
  -p "Explain binary search in simple terms." \
  -n 300

llama-server

llama-server \
  -m gguf/Supertron2.1-0.6B-Q4_K_M.gguf \
  -c 4096 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Then call it with an OpenAI-compatible client.

Ollama Modelfile

Create a file named Modelfile:

FROM ./gguf/Supertron2.1-0.6B-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER top_k 20
PARAMETER num_ctx 4096

SYSTEM """
You are Supertron, a helpful assistant focused on math, coding, and general knowledge.
"""

Create and run:

ollama create supertron2.1-0.6b -f Modelfile
ollama run supertron2.1-0.6b

Recommended Settings

For coding and math:

temperature: 0.2
top_p: 0.8
top_k: 20
repeat_penalty: 1.05

For chat:

temperature: 0.7
top_p: 0.8
top_k: 20
repeat_penalty: 1.05

For deterministic answers:

temperature: 0.0

Model Line

  • Original model: Surpem/Supertron2.1-0.6B
  • GGUF model: Surpem/Supertron2.1-0.6B-GGUF
  • MLX 4-bit: Surpem/Supertron2.1-0.6B-MLX-4Bit
  • MLX 8-bit: Surpem/Supertron2.1-0.6B-MLX-8Bit

Notes

The GGUF files were converted from the latest Supertron2.1-0.6B Transformers checkpoint using llama.cpp tooling. Quantized models are approximations of the original bf16 checkpoint, and behavior can vary by runtime, prompt format, and sampling settings.

Limitations

  • Q4 is smaller but less precise than Q8/F16.
  • The model can hallucinate or produce wrong code.
  • Human review is recommended for math, code, and factual claims.
  • Do not use this model for safety-critical decisions.

License

Apache 2.0.

Downloads last month
88
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Surpem/Supertron2.1-0.6B-GGUF

Finetuned
Qwen/Qwen3-0.6B
Quantized
(1)
this model

Collection including Surpem/Supertron2.1-0.6B-GGUF