Instructions to use srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4
Run Hermes
hermes
- MLX LM
How to use srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4", "messages": [ {"role": "user", "content": "Hello"} ] }'
Gemma-4-12B-Coder (fable5 x composer2.5) - native MLX (nvfp4) for Apple Silicon
A GGUF-free, native-MLX build of yuxinlu1/gemma-4-12B-coder-fable5-composer2.5,
runnable on Apple Silicon via KrillLM's native Swift + MLX engine (no Python at inference).
Credit
The model is @yuxinlu1's fine-tune of google/gemma-4-12B-it - a Python/algorithmic coding model that reasons in Gemma's thinking channel before emitting a solution. All capability credit is theirs; please star the original repo. This repo only re-packages the weights to run natively on MLX.
What this is
The fine-tune is published as GGUF (lossy k-quants) and as NVFP4 safetensors (compressed-tensors). This build is converted GGUF-free from the NVFP4 side into MLX:
- decompress NVFP4 (compressed-tensors) -> bf16 (pure numpy; byte-exact to the source's stored values, self-checked), then
- requantize to MLX nvfp4 with attention
o_projand the vision/audio projectors protected at 8-bit (the proven mixed-precision recipe).
So it is lossless by construction relative to the NVFP4 source. Converter + recipe: KrillLM tools/ and docs/GEMMA4_12B_CODER_FINETUNE.md.
Use
brew tap srvsngh99/krillm && brew install krillm
KRILL_ENABLE_THINKING=1 krillm run gemma-4-12b-coder
KRILL_ENABLE_THINKING=1 opens the model's reasoning channel (this is a thinking fine-tune); without it, it answers without reasoning. ~6.7 GB, runs on a 24 GB Mac at ~25 tok/s decode, ~1.6 s cold load.
Benchmarks
Apple M4 Pro, 24 GB, macOS. KrillLM v0.8.0, this nvfp4 build. Reasoning on, greedy (temperature 0), single sample.
Standard EvalPlus harness (leaderboard-comparable):
| metric | pass@1 |
|---|---|
| HumanEval (base) | 85.4% |
| HumanEval+ (base + extra tests) | 83.5% |
For reference, KrillLM's own (more lenient) extraction harness scored HumanEval 89.6% reasoning-on / 82.9% reasoning-off on the same problems; the EvalPlus numbers above are the comparable ones. These reflect this fine-tune's capability measured through KrillLM's MLX runtime, not a KrillLM-vs-other-engine claim.
Notes
De-refused (not safety-aligned - add your own guardrails) and English/Python-centric, per the upstream model. This is a weight-only conversion; behavior is the upstream fine-tune's.
- Downloads last month
- 1,182
Quantized
Model tree for srv-sngh/gemma-4-12B-coder-fable5-composer2.5-nvfp4
Base model
google/gemma-4-12B