Grug-12B VLM MLX

Apple Silicon MLX VLM quantizations of kai-os/Grug-12B, packaged as a single Hugging Face repo with one folder per quantization level.

Grug-12B is a compact-reasoning fine-tune of google/gemma-4-12B-it. The source model was released as merged Transformers/safetensors weights after QLoRA training. This repo only provides MLX quantized derivatives for Apple Silicon inference and keeps the original vision-language model structure.

Highlights

Vision-language support is preserved through the Gemma 4 unified VLM config.
Three MLX affine quantizations are available in one repo: 8-bit, 6-bit, and 4-bit.
Benchmarked with oMLX on the MLX LM engine; screenshots are included below.
The original BF16 Transformers weights remain in the source repo.

Available variants

Variant	Folder	Quantization	Size	Best fit
MLX 8-bit	`mlx-8bit/`	affine, group size 64	12 GB	Highest-quality local MLX run.
MLX 6-bit	`mlx-6bit/`	affine, group size 64	9.1 GB	Balanced quality, memory, and speed.
MLX 4-bit	`mlx-4bit/`	affine, group size 64	6.3 GB	Smallest footprint and best peak memory.

These are not GGUF files and are not llama.cpp quants. They are MLX safetensors folders intended for mlx-vlm.

Benchmarks

Benchmarks were run with oMLX, using the Force mlx-lm engine. Each run used prompt prefill sizes of 1024, 4096, and 8192 tokens with 128 generated tokens. Values below are copied from the captured benchmark output.

Hardware: Apple Mac Studio with M4 Max and 64 GB unified memory.

Variant	pp1024 tg TPS	pp4096 tg TPS	pp8192 tg TPS	pp8192 E2E	Peak mem
`mlx-8bit`	30.3 tok/s	20.4 tok/s	31.6 tok/s	20.189 s	13.80 GB
`mlx-6bit`	38.9 tok/s	38.7 tok/s	37.8 tok/s	19.795 s	11.03 GB
`mlx-4bit`	21.7 tok/s	15.7 tok/s	50.9 tok/s	18.540 s	8.26 GB

Continuous batching at pp1024 / tg128:

Variant	Batch 1 tg TPS	Batch 2 tg TPS	Batch 2 speedup
`mlx-8bit`	30.3 tok/s	34.2 tok/s	1.13x
`mlx-6bit`	38.9 tok/s	40.5 tok/s	1.04x
`mlx-4bit`	21.7 tok/s	56.1 tok/s	2.59x

Benchmark screenshots

MLX 8-bit

MLX 6-bit

MLX 4-bit

Usage

Download only the variant you want:

from pathlib import Path
from huggingface_hub import snapshot_download

repo_id = "chanderbalaji/Grug-12B-VLM-MLX"
variant = "mlx-4bit"

snapshot = snapshot_download(
    repo_id,
    allow_patterns=[f"{variant}/*"],
)
model_path = Path(snapshot) / variant
print(model_path)

Run with mlx-vlm:

python -m mlx_vlm.generate \
  --model /path/to/downloaded/snapshot/mlx-4bit \
  --prompt "Describe this image." \
  --image /path/to/image.jpg \
  --max-tokens 256

For text-only prompts, omit the --image argument.

Provenance and attribution

Source model: kai-os/Grug-12B
Base model: google/gemma-4-12B-it
Relationship: MLX quantized derivatives of the source model
Source revision used locally: ad3feab42542e3361dcaf0ebe795d55009765918
Conversion target: Gemma 4 unified VLM with vision_config preserved

The source model card describes the original training recipe, datasets, local evaluation, limitations, and acknowledgements. Please refer to that card for the full model provenance and license context.

Limitations

Quantization can change output quality, numerical behavior, and edge-case performance. These files are intended for local MLX inference on Apple Silicon. Use the source model repo for the original BF16 Transformers weights.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

4-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chanderbalaji/Grug-12B-VLM-MLX

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it