Grug-12B VLM MLX

Apple Silicon MLX VLM quantizations of kai-os/Grug-12B, packaged as a single Hugging Face repo with one folder per quantization level.

Grug-12B is a compact-reasoning fine-tune of google/gemma-4-12B-it. The source model was released as merged Transformers/safetensors weights after QLoRA training. This repo only provides MLX quantized derivatives for Apple Silicon inference and keeps the original vision-language model structure.

Highlights

  • Vision-language support is preserved through the Gemma 4 unified VLM config.
  • Three MLX affine quantizations are available in one repo: 8-bit, 6-bit, and 4-bit.
  • Benchmarked with oMLX on the MLX LM engine; screenshots are included below.
  • The original BF16 Transformers weights remain in the source repo.

Available variants

Variant Folder Quantization Size Best fit
MLX 8-bit mlx-8bit/ affine, group size 64 12 GB Highest-quality local MLX run.
MLX 6-bit mlx-6bit/ affine, group size 64 9.1 GB Balanced quality, memory, and speed.
MLX 4-bit mlx-4bit/ affine, group size 64 6.3 GB Smallest footprint and best peak memory.

These are not GGUF files and are not llama.cpp quants. They are MLX safetensors folders intended for mlx-vlm.

Benchmarks

Benchmarks were run with oMLX, using the Force mlx-lm engine. Each run used prompt prefill sizes of 1024, 4096, and 8192 tokens with 128 generated tokens. Values below are copied from the captured benchmark output.

Hardware: Apple Mac Studio with M4 Max and 64 GB unified memory.

Variant pp1024 tg TPS pp4096 tg TPS pp8192 tg TPS pp8192 E2E Peak mem
mlx-8bit 30.3 tok/s 20.4 tok/s 31.6 tok/s 20.189 s 13.80 GB
mlx-6bit 38.9 tok/s 38.7 tok/s 37.8 tok/s 19.795 s 11.03 GB
mlx-4bit 21.7 tok/s 15.7 tok/s 50.9 tok/s 18.540 s 8.26 GB

Continuous batching at pp1024 / tg128:

Variant Batch 1 tg TPS Batch 2 tg TPS Batch 2 speedup
mlx-8bit 30.3 tok/s 34.2 tok/s 1.13x
mlx-6bit 38.9 tok/s 40.5 tok/s 1.04x
mlx-4bit 21.7 tok/s 56.1 tok/s 2.59x
Benchmark screenshots

MLX 8-bit

oMLX benchmark for Grug-12B VLM 8-bit

MLX 6-bit

oMLX benchmark for Grug-12B VLM 6-bit

MLX 4-bit

oMLX benchmark for Grug-12B VLM 4-bit

Usage

Download only the variant you want:

from pathlib import Path
from huggingface_hub import snapshot_download

repo_id = "chanderbalaji/Grug-12B-VLM-MLX"
variant = "mlx-4bit"

snapshot = snapshot_download(
    repo_id,
    allow_patterns=[f"{variant}/*"],
)
model_path = Path(snapshot) / variant
print(model_path)

Run with mlx-vlm:

python -m mlx_vlm.generate \
  --model /path/to/downloaded/snapshot/mlx-4bit \
  --prompt "Describe this image." \
  --image /path/to/image.jpg \
  --max-tokens 256

For text-only prompts, omit the --image argument.

Provenance and attribution

  • Source model: kai-os/Grug-12B
  • Base model: google/gemma-4-12B-it
  • Relationship: MLX quantized derivatives of the source model
  • Source revision used locally: ad3feab42542e3361dcaf0ebe795d55009765918
  • Conversion target: Gemma 4 unified VLM with vision_config preserved

The source model card describes the original training recipe, datasets, local evaluation, limitations, and acknowledgements. Please refer to that card for the full model provenance and license context.

Limitations

Quantization can change output quality, numerical behavior, and edge-case performance. These files are intended for local MLX inference on Apple Silicon. Use the source model repo for the original BF16 Transformers weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chanderbalaji/Grug-12B-VLM-MLX

Finetuned
kai-os/Grug-12B
Quantized
(4)
this model

Datasets used to train chanderbalaji/Grug-12B-VLM-MLX