Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM

A sensitivity-graded 3.6-bit MLX quantization of moonshotai/Kimi-K2.7-Code — a ~1T-parameter (32B active) DeepSeek-V3-style MoE model with a MoonViT vision tower — that keeps the vision encoder, so it does image-text-to-text (and video) on Apple-Silicon M3 Ultra.

465.9 GB on disk. Loads on a single clean 512 GB M3 Ultra (peak 467 GB), or split across two over Thunderbolt.

quality

This is the +vision build: the same model as the recommended text/code build Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpwbyte-identical LLM weights (xet-deduped on the Hub) — plus the full MoonViT vision tower + multimodal projector (335 tensors, bf16, +0.9 GB). Get this build only if you need image / video; for pure text/code the text build is leaner and runs on the more battle-tested mlx-lm stack (validated two-machine pipeline + mlx_lm.server). Unlike every other community MLX build of Kimi-K2.x — which drop the vision tower — this one keeps MoonViT so the model can actually see.

What works

Modality Status
Text / code ✅ full (same LLM as the text build)
Image ✅ validated — correct, detailed descriptions; ~23 tok/s decode, peak 467 GB
Video ✅ the MoonViT 3D path (temporal pos-emb + spatial-temporal attention + temporal-pool merger + block-diagonal varlen attention) is ported into mlx-vlm, runs at ~23 tok/s decode, and with multi-chunk input the model reasons about motion/changes across frames.

Image example (test photo: a person from behind in a knit beanie + tan corduroy jacket, foggy forest):

"a single person photographed from behind … a thick, chunky knit beanie in muted gray … a tan/caramel-brown corduroy jacket with a prominent hood … the background is soft, blurred and misty/foggy …"

Recipe (verified from config.json)

recipe

Same sensitivity-graded LLM recipe as the text build, plus the vision tower kept in bf16:

Component Bits
Routed experts gate/up 3-bit g64
Routed experts down_proj 4-bit on 16/60 layers, 3-bit elsewhere
Attention (MLA) · shared · dense · embed · head 6-bit g64
MoE router bf16
MoonViT vision tower + mm_projector bf16 (335 tensors, ~0.9 GB)

Effective 3.629 bits/weight. Re-quantized from the INT4 master via the #907 dequant-first fix (asking for 3-bit on a compressed-tensors source otherwise silently keeps the experts at 4-bit → ~5 bpw / 640 GB).

LLM quality. Because the language-model weights are byte-identical to the text build, its measured quality carries over directly: mean KL(4-bit ref ‖ this) = 0.199 ± 0.009 nats (median 0.006), top-1 flip-rate 10.2% (416/4096) against a 4-bit reference of the same model. Distributionally near-identical to a 4-bit build on typical tokens, with a ~10% greedy top-1 divergence concentrated on harder positions (the expected cost of 3-bit experts) — see the text card for the full breakdown and method. The MoonViT vision tower is kept in bf16, so it adds no quantization error.

Usage

Needs an mlx-vlm with the kimi_k25 model (0.6.3+) and tiktoken/blobfile for the tokenizer.

Image (fast native path):

python -m mlx_vlm generate \
  --model avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM \
  --prompt "Describe this image." --image photo.jpg \
  --max-tokens 512 --temperature 0.0

Video — mlx-vlm 0.6.3's stock kimi_k25 vision tower is 2D-only and its video CLI is Qwen-specific. This build ships a patched vision.py (3D MoonViT: temporal pos-emb, per-frame RoPE, t·h·w cu_seqlens, sd2_tpool temporal-pool merger, block-diagonal varlen attention so multi-chunk doesn't OOM) + a helper video_infer.py that decodes frames (cv2), groups them into temporal chunks, and runs the native fast generate path:

python video_infer.py <model> --video clip.mp4 --num-frames 16 --chunk-frames 4 \
       --prompt "Describe what happens over time in this video."

How it works: each chunk of ≤4 frames is temporally mean-pooled to one spatial token set (sd2_tpool); using multiple chunks gives the LLM a temporal sequence, so it reasons about movement/changes across frames. Decode runs at the same ~23 tok/s as images (the helper wraps generation in wired_limit to keep the 465 GB weights resident — without it decode pages from mmap and collapses to ~0.2 tok/s).

Memory

memory

MLA keeps the KV cache tiny (≈68.6 KB/token); native context 256K. Peak at image inference 467 GB on a single 512 GB box; the LLM also supports pipeline/tensor-parallel split across two machines (the vision tower is tiny and runs on one).

Credits & citation

Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM — sensitivity-graded 3.6-bit image+video MLX quantization of Kimi-K2.7-Code (vision tower kept), 2026.

Downloads last month
405
Safetensors
Model size
1T params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM

Quantized
(20)
this model