Instructions to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM") config = load_config("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM
Run Hermes
hermes
Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM
A sensitivity-graded 3.6-bit MLX quantization of moonshotai/Kimi-K2.7-Code — a ~1T-parameter (32B active) DeepSeek-V3-style MoE model with a MoonViT vision tower — that keeps the vision encoder, so it does image-text-to-text (and video) on Apple-Silicon M3 Ultra.
465.9 GB on disk. Loads on a single clean 512 GB M3 Ultra (peak 467 GB), or split across two over Thunderbolt.
This is the +vision build: the same model as the recommended text/code build Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw — byte-identical LLM weights (xet-deduped on the Hub) — plus the full MoonViT vision tower + multimodal projector (335 tensors, bf16, +0.9 GB). Get this build only if you need image / video; for pure text/code the text build is leaner and runs on the more battle-tested mlx-lm stack (validated two-machine pipeline + mlx_lm.server). Unlike every other community MLX build of Kimi-K2.x — which drop the vision tower — this one keeps MoonViT so the model can actually see.
What works
| Modality | Status |
|---|---|
| Text / code | ✅ full (same LLM as the text build) |
| Image | ✅ validated — correct, detailed descriptions; ~23 tok/s decode, peak 467 GB |
| Video | ✅ the MoonViT 3D path (temporal pos-emb + spatial-temporal attention + temporal-pool merger + block-diagonal varlen attention) is ported into mlx-vlm, runs at ~23 tok/s decode, and with multi-chunk input the model reasons about motion/changes across frames. |
Image example (test photo: a person from behind in a knit beanie + tan corduroy jacket, foggy forest):
"a single person photographed from behind … a thick, chunky knit beanie in muted gray … a tan/caramel-brown corduroy jacket with a prominent hood … the background is soft, blurred and misty/foggy …"
Recipe (verified from config.json)
Same sensitivity-graded LLM recipe as the text build, plus the vision tower kept in bf16:
| Component | Bits |
|---|---|
Routed experts gate/up |
3-bit g64 |
Routed experts down_proj |
4-bit on 16/60 layers, 3-bit elsewhere |
| Attention (MLA) · shared · dense · embed · head | 6-bit g64 |
| MoE router | bf16 |
| MoonViT vision tower + mm_projector | bf16 (335 tensors, ~0.9 GB) |
Effective 3.629 bits/weight. Re-quantized from the INT4 master via the #907 dequant-first fix (asking for 3-bit on a compressed-tensors source otherwise silently keeps the experts at 4-bit → ~5 bpw / 640 GB).
LLM quality. Because the language-model weights are byte-identical to the text build, its measured quality carries over directly: mean KL(4-bit ref ‖ this) = 0.199 ± 0.009 nats (median 0.006), top-1 flip-rate 10.2% (416/4096) against a 4-bit reference of the same model. Distributionally near-identical to a 4-bit build on typical tokens, with a ~10% greedy top-1 divergence concentrated on harder positions (the expected cost of 3-bit experts) — see the text card for the full breakdown and method. The MoonViT vision tower is kept in bf16, so it adds no quantization error.
Usage
Needs an mlx-vlm with the kimi_k25 model (0.6.3+) and tiktoken/blobfile for the tokenizer.
Image (fast native path):
python -m mlx_vlm generate \
--model avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM \
--prompt "Describe this image." --image photo.jpg \
--max-tokens 512 --temperature 0.0
Video — mlx-vlm 0.6.3's stock kimi_k25 vision tower is 2D-only and its video CLI is Qwen-specific. This build ships a patched vision.py (3D MoonViT: temporal pos-emb, per-frame RoPE, t·h·w cu_seqlens, sd2_tpool temporal-pool merger, block-diagonal varlen attention so multi-chunk doesn't OOM) + a helper video_infer.py that decodes frames (cv2), groups them into temporal chunks, and runs the native fast generate path:
python video_infer.py <model> --video clip.mp4 --num-frames 16 --chunk-frames 4 \
--prompt "Describe what happens over time in this video."
How it works: each chunk of ≤4 frames is temporally mean-pooled to one spatial token set (sd2_tpool); using multiple chunks gives the LLM a temporal sequence, so it reasons about movement/changes across frames. Decode runs at the same ~23 tok/s as images (the helper wraps generation in wired_limit to keep the 465 GB weights resident — without it decode pages from mmap and collapses to ~0.2 tok/s).
Memory
MLA keeps the KV cache tiny (≈68.6 KB/token); native context 256K. Peak at image inference 467 GB on a single 512 GB box; the LLM also supports pipeline/tensor-parallel split across two machines (the vision tower is tiny and runs on one).
Credits & citation
- Base: moonshotai/Kimi-K2.7-Code (Modified MIT). Built with mlx-vlm / mlx-lm on Apple MLX.
Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM — sensitivity-graded 3.6-bit image+video MLX quantization of Kimi-K2.7-Code (vision tower kept), 2026.
- Downloads last month
- 405
3-bit
Model tree for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw-VLM
Base model
moonshotai/Kimi-K2.7-Code

