Instructions to use chanderbalaji/Grug-12B-VLM-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use chanderbalaji/Grug-12B-VLM-MLX with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("chanderbalaji/Grug-12B-VLM-MLX") config = load_config("chanderbalaji/Grug-12B-VLM-MLX") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Grug-12B VLM MLX
Apple Silicon MLX VLM quantizations of
kai-os/Grug-12B, packaged as a
single Hugging Face repo with one folder per quantization level.
Grug-12B is a compact-reasoning fine-tune of
google/gemma-4-12B-it. The
source model was released as merged Transformers/safetensors weights after
QLoRA training. This repo only provides MLX quantized derivatives for Apple
Silicon inference and keeps the original vision-language model structure.
Highlights
- Vision-language support is preserved through the Gemma 4 unified VLM config.
- Three MLX affine quantizations are available in one repo: 8-bit, 6-bit, and 4-bit.
- Benchmarked with oMLX on the MLX LM engine; screenshots are included below.
- The original BF16 Transformers weights remain in the source repo.
Available variants
| Variant | Folder | Quantization | Size | Best fit |
|---|---|---|---|---|
| MLX 8-bit | mlx-8bit/ |
affine, group size 64 | 12 GB | Highest-quality local MLX run. |
| MLX 6-bit | mlx-6bit/ |
affine, group size 64 | 9.1 GB | Balanced quality, memory, and speed. |
| MLX 4-bit | mlx-4bit/ |
affine, group size 64 | 6.3 GB | Smallest footprint and best peak memory. |
These are not GGUF files and are not llama.cpp quants. They are MLX safetensors
folders intended for mlx-vlm.
Benchmarks
Benchmarks were run with oMLX, using the
Force mlx-lm engine. Each run used prompt prefill sizes of 1024, 4096, and
8192 tokens with 128 generated tokens. Values below are copied from the
captured benchmark output.
Hardware: Apple Mac Studio with M4 Max and 64 GB unified memory.
| Variant | pp1024 tg TPS | pp4096 tg TPS | pp8192 tg TPS | pp8192 E2E | Peak mem |
|---|---|---|---|---|---|
mlx-8bit |
30.3 tok/s | 20.4 tok/s | 31.6 tok/s | 20.189 s | 13.80 GB |
mlx-6bit |
38.9 tok/s | 38.7 tok/s | 37.8 tok/s | 19.795 s | 11.03 GB |
mlx-4bit |
21.7 tok/s | 15.7 tok/s | 50.9 tok/s | 18.540 s | 8.26 GB |
Continuous batching at pp1024 / tg128:
| Variant | Batch 1 tg TPS | Batch 2 tg TPS | Batch 2 speedup |
|---|---|---|---|
mlx-8bit |
30.3 tok/s | 34.2 tok/s | 1.13x |
mlx-6bit |
38.9 tok/s | 40.5 tok/s | 1.04x |
mlx-4bit |
21.7 tok/s | 56.1 tok/s | 2.59x |
Usage
Download only the variant you want:
from pathlib import Path
from huggingface_hub import snapshot_download
repo_id = "chanderbalaji/Grug-12B-VLM-MLX"
variant = "mlx-4bit"
snapshot = snapshot_download(
repo_id,
allow_patterns=[f"{variant}/*"],
)
model_path = Path(snapshot) / variant
print(model_path)
Run with mlx-vlm:
python -m mlx_vlm.generate \
--model /path/to/downloaded/snapshot/mlx-4bit \
--prompt "Describe this image." \
--image /path/to/image.jpg \
--max-tokens 256
For text-only prompts, omit the --image argument.
Provenance and attribution
- Source model:
kai-os/Grug-12B - Base model:
google/gemma-4-12B-it - Relationship: MLX quantized derivatives of the source model
- Source revision used locally:
ad3feab42542e3361dcaf0ebe795d55009765918 - Conversion target: Gemma 4 unified VLM with
vision_configpreserved
The source model card describes the original training recipe, datasets, local evaluation, limitations, and acknowledgements. Please refer to that card for the full model provenance and license context.
Limitations
Quantization can change output quality, numerical behavior, and edge-case performance. These files are intended for local MLX inference on Apple Silicon. Use the source model repo for the original BF16 Transformers weights.
4-bit


