Update

Added a Jinja chat template so the model can format conversations correctly and work smoothly with mlx-lm chat-style inference.

MLX 4-Bit Quantized: Gemma-4-12B-Coder

This repository contains an 4-bit MLX-converted version of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1.

The model has been quantized to 4-bit to dramatically reduce memory requirements while retaining near-lossless reasoning and coding capabilities. It is optimized for local inference on Apple Silicon Macs using the mlx-lm library.

How to Use with MLX

Install the required dependency:

pip install --upgrade mlx-lm

Run inference from Python:

from mlx_lm import load, generate

# Load the 8-bit quantized MLX model.
model, tokenizer = load("mlx-community/gemma-4-12b-coder-fable5-composer2.5-4bit")

prompt = "Write a Python script to sort a dictionary by its values."
messages = [{"role": "user", "content": prompt}]

formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=formatted_prompt,
    verbose=True,
    max_tokens=1024,
)
response = generate(
    model,
    tokenizer,
    prompt=formatted_prompt,
    verbose=True,
    max_tokens=1024,
    temp=0.0,
)

Base and License

Free to use, modify, and redistribute under the Apache 2.0 license.

Downloads last month
289
Safetensors
Model size
12B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/gemma-4-12b-coder-fable5-composer2.5-4bit

Collection including mlx-community/gemma-4-12b-coder-fable5-composer2.5-4bit