Carwin-28B-MLX-MTP

A dense 27B local model for Apple Silicon: a DARE-TIES merge of reasoning-heavy Darwin and agent/tool-calling-heavy Carnice, on the Qwen3.6-27B base, packaged as a 4-bit MLX model with the Qwen3.6 MTP (multi-token prediction) head preserved for self-speculative decoding.

Why this exists

Built by a tech/AI hobbyist who enjoys tinkering with the Hermes agent framework. The goal was personal and specific: combine the tool-calling strength of Carnice with the reasoning of Darwin into one local, private model. This is the MLX build, made to run natively on Apple Silicon via oMLX / mlx-lm. (A GGUF build of the same model exists for llama.cpp.)

What it is

Base Qwen/Qwen3.6-27B
Reasoning parent FINAL-Bench/Darwin-28B-Opus
Agent / tool-calling parent kai-os/Carnice-V2-27b
Merge method DARE-TIES (50/50, density 0.53 each, BF16)
Format MLX (Apple Silicon)
Quantization 4-bit body + BF16 MTP head
Size on disk ~15 GB (4-bit MLX)
MTP 15 MTP head tensors grafted from the Qwen3.6-27B base, preserved as a BF16 shard (not crushed to 4-bit)
License Apache-2.0 (all three parent lines permissive)

How it was built

Built entirely on a single 32GB Mac Studio (M2 Max), agent-driven through the Hermes framework using a mix of MiMo v2.5 and GPT-5.5 — no cloud GPUs or rented compute.

The model was produced from a full-precision BF16 master: a DARE-TIES merge of Darwin and Carnice against the Qwen3.6-27B base, with the 15-tensor Qwen3.6 MTP head grafted in. MLX and GGUF are separate branches off that one master — one format is not converted from the other.

MLX path:

  1. Tensor verification — confirmed Darwin and the Qwen3.6-27B base shared the needed architecture/tensor structure, and confirmed which MTP tensors had to be grafted. Safetensors indexes were treated as insufficient proof; shape/dtype/name checks were done directly.
  2. DARE-TIES merge — merged Darwin and Carnice against the Qwen3.6-27B base (mergekit). The pre-graft output contained the merged body tensors only.
  3. MTP graft — copied the 15 mtp.* head tensors from the base into the merged output, written into a separate safetensors shard with an updated index. Verified the expected total tensor count.
  4. 4-bit MLX quantization — quantized the body to MLX 4-bit while keeping the MTP tensors as a separate BF16 shard (the quant config's ignore list excludes the MTP modules so they're not 4-bit). This keeps the draft head near-lossless.
  5. Byte verification — every MTP tensor was byte-checked by opening the actual safetensors shard, not just reading the index. Silent MTP drop is the known failure mode for this kind of work, so presence is confirmed by reading the actual bytes.

Files

A standard MLX model directory: 4-bit body shards, a separate BF16 MTP shard, config, tokenizer, and chat template.

Running (Apple Silicon)

Requires an MLX runtime with Qwen3.6 support (oMLX, or mlx-lm). Load the model directory as you would any MLX model.

Notes:

  • This is a dense 27B model — thorough and local, not small-and-fast.
  • The MTP head is preserved in the package for self-speculative decoding; whether MTP is engaged is a runtime setting in your serving stack.
  • Because the body is a merge that drifted from stock Qwen3.6 while the MTP head comes from the base, draft acceptance may differ from stock Qwen3.6. Measure on your own setup.
  • Thinking control is best handled per-request (e.g. an enable_thinking chat-template flag) rather than a static default, depending on your runtime.

Validation

Confirmed during build:

  • MTP head verified present as a BF16 shard, byte-checked against the source tensors.
  • Reasoning: the bat-and-ball problem is answered correctly (the ball costs $0.05), and a classic "drive or walk to the car wash" trick question is handled correctly.
  • Tool-calling: clean single-tool and multi-tool OpenAI-style function calls render correctly.

Performance (tokens/sec, draft-acceptance rate) has not been benchmarked under controlled conditions and is intentionally not stated here. Measure it on your own hardware.

Known quirks

  • Identity: the model may identify as Gemini. This is cosmetic lineage residue from the merge, not a fault.
  • Dense 27B: thorough and local, not small-and-fast.
  • MTP preserved, acceptance unmeasured on this merge: the head is physically in the package; how well speculative decoding accepts on the merged body is for you to measure.
  • Large agent prompts: very large prompts (tens of thousands of tokens) can be slow to process on this class of hardware; trimmed prompts run cleaner.

Credits

All credit to the authors of the parent models and base: FINAL-Bench/Darwin-28B-Opus, kai-os/Carnice-V2-27b, and Qwen/Qwen3.6-27B. Merged with mergekit.

Downloads last month
-
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for isneezekittens/Carwin-28B-MTP-MLX

Quantized
(4)
this model