ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit

MLX 8bit conversion of google/gemma-4-26b-a4b-it (Gemma 4 26B-A4B, MoE / 128 experts), made with mlx-vlm 0.6.3.

Provenance (self-converted)

  • Source: google/gemma-4-26b-a4b-it (license: gemma)
  • Quantization: 8-bit (group_size=64, ~8.67 bpw)
  • Tool: mlx-vlm 0.6.3 mlx_vlm.convert (experts stored fused in source; no patch needed)
  • Validated: loads and translates correctly under mlx-vlm 0.6.3.

Usage

from mlx_vlm import load
model, processor = load("ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit")

License

Derivative of Google Gemma; governed by the Gemma Terms of Use (https://ai.google.dev/gemma/terms) and Prohibited Use Policy. Converted to MLX.

⚡ Faster generation with MTP (speculative decoding, lossless)

Recommended drafter: google/gemma-4-26B-A4B-it-assistant — Google's official MTP drafter for this model. It loads directly in mlx-vlm (no conversion needed) and gives up to ~3x faster generation (≈1.4–1.5x measured on short prompts); output is identical to non-MTP decoding.

# requires:  pip install "mlx-vlm>=0.6.3"
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit")
draft_model, _   = load("google/gemma-4-26B-A4B-it-assistant")
config = load_config("ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit")

prompt = apply_chat_template(processor, config, "Hello!", num_images=0)
out = generate(model, processor, prompt,
               draft_model=draft_model, draft_kind="mtp", max_tokens=256)

CLI (draft_kind auto-detected): mlx_vlm.generate --model ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit --draft-model google/gemma-4-26B-A4B-it-assistant

Notes

  • draft_kind="mtp" is required in the Python API (the CLI auto-detects it).
  • Use this model's own drafter above — drafters are size-specific and not interchangeable across Gemma 4 variants.
  • Needs mlx-vlm >= 0.6.3. MTP is lossless — if output differs from non-MTP, your versions are mismatched.
Downloads last month
49
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ToPo-ToPo/gemma-4-26B-A4B-it-mlx-8bit