Instructions to use isneezekittens/Carwin-28B-MTP-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use isneezekittens/Carwin-28B-MTP-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("isneezekittens/Carwin-28B-MTP-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use isneezekittens/Carwin-28B-MTP-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "isneezekittens/Carwin-28B-MTP-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "isneezekittens/Carwin-28B-MTP-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use isneezekittens/Carwin-28B-MTP-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "isneezekittens/Carwin-28B-MTP-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default isneezekittens/Carwin-28B-MTP-MLX
Run Hermes
hermes
- MLX LM
How to use isneezekittens/Carwin-28B-MTP-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "isneezekittens/Carwin-28B-MTP-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "isneezekittens/Carwin-28B-MTP-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "isneezekittens/Carwin-28B-MTP-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Carwin-28B-MLX-MTP
A dense 27B local model for Apple Silicon: a DARE-TIES merge of reasoning-heavy Darwin and agent/tool-calling-heavy Carnice, on the Qwen3.6-27B base, packaged as a 4-bit MLX model with the Qwen3.6 MTP (multi-token prediction) head preserved for self-speculative decoding.
Why this exists
Built by a tech/AI hobbyist who enjoys tinkering with the Hermes agent framework. The goal was personal and specific: combine the tool-calling strength of Carnice with the reasoning of Darwin into one local, private model. This is the MLX build, made to run natively on Apple Silicon via oMLX / mlx-lm. (A GGUF build of the same model exists for llama.cpp.)
What it is
| Base | Qwen/Qwen3.6-27B |
| Reasoning parent | FINAL-Bench/Darwin-28B-Opus |
| Agent / tool-calling parent | kai-os/Carnice-V2-27b |
| Merge method | DARE-TIES (50/50, density 0.53 each, BF16) |
| Format | MLX (Apple Silicon) |
| Quantization | 4-bit body + BF16 MTP head |
| Size on disk | ~15 GB (4-bit MLX) |
| MTP | 15 MTP head tensors grafted from the Qwen3.6-27B base, preserved as a BF16 shard (not crushed to 4-bit) |
| License | Apache-2.0 (all three parent lines permissive) |
How it was built
Built entirely on a single 32GB Mac Studio (M2 Max), agent-driven through the Hermes framework using a mix of MiMo v2.5 and GPT-5.5 — no cloud GPUs or rented compute.
The model was produced from a full-precision BF16 master: a DARE-TIES merge of Darwin and Carnice against the Qwen3.6-27B base, with the 15-tensor Qwen3.6 MTP head grafted in. MLX and GGUF are separate branches off that one master — one format is not converted from the other.
MLX path:
- Tensor verification — confirmed Darwin and the Qwen3.6-27B base shared the needed architecture/tensor structure, and confirmed which MTP tensors had to be grafted. Safetensors indexes were treated as insufficient proof; shape/dtype/name checks were done directly.
- DARE-TIES merge — merged Darwin and Carnice against the Qwen3.6-27B base (mergekit). The pre-graft output contained the merged body tensors only.
- MTP graft — copied the 15
mtp.*head tensors from the base into the merged output, written into a separate safetensors shard with an updated index. Verified the expected total tensor count. - 4-bit MLX quantization — quantized the body to MLX 4-bit while keeping the MTP tensors as a separate BF16 shard (the quant config's ignore list excludes the MTP modules so they're not 4-bit). This keeps the draft head near-lossless.
- Byte verification — every MTP tensor was byte-checked by opening the actual safetensors shard, not just reading the index. Silent MTP drop is the known failure mode for this kind of work, so presence is confirmed by reading the actual bytes.
Files
A standard MLX model directory: 4-bit body shards, a separate BF16 MTP shard, config, tokenizer, and chat template.
Running (Apple Silicon)
Requires an MLX runtime with Qwen3.6 support (oMLX, or mlx-lm). Load the model directory as you would any MLX model.
Notes:
- This is a dense 27B model — thorough and local, not small-and-fast.
- The MTP head is preserved in the package for self-speculative decoding; whether MTP is engaged is a runtime setting in your serving stack.
- Because the body is a merge that drifted from stock Qwen3.6 while the MTP head comes from the base, draft acceptance may differ from stock Qwen3.6. Measure on your own setup.
- Thinking control is best handled per-request (e.g. an
enable_thinkingchat-template flag) rather than a static default, depending on your runtime.
Validation
Confirmed during build:
- MTP head verified present as a BF16 shard, byte-checked against the source tensors.
- Reasoning: the bat-and-ball problem is answered correctly (the ball costs $0.05), and a classic "drive or walk to the car wash" trick question is handled correctly.
- Tool-calling: clean single-tool and multi-tool OpenAI-style function calls render correctly.
Performance (tokens/sec, draft-acceptance rate) has not been benchmarked under controlled conditions and is intentionally not stated here. Measure it on your own hardware.
Known quirks
- Identity: the model may identify as Gemini. This is cosmetic lineage residue from the merge, not a fault.
- Dense 27B: thorough and local, not small-and-fast.
- MTP preserved, acceptance unmeasured on this merge: the head is physically in the package; how well speculative decoding accepts on the merged body is for you to measure.
- Large agent prompts: very large prompts (tens of thousands of tokens) can be slow to process on this class of hardware; trimmed prompts run cleaner.
Credits
All credit to the authors of the parent models and base: FINAL-Bench/Darwin-28B-Opus, kai-os/Carnice-V2-27b, and Qwen/Qwen3.6-27B. Merged with mergekit.
- Downloads last month
- -
4-bit
Model tree for isneezekittens/Carwin-28B-MTP-MLX
Base model
FINAL-Bench/Darwin-28B-Opus