Instructions to use spicyneuron/Kimi-K2.7-Code-MLX-3.6bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Kimi-K2.7-Code-MLX-3.6bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Kimi-K2.7-Code-MLX-3.6bit") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use spicyneuron/Kimi-K2.7-Code-MLX-3.6bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "spicyneuron/Kimi-K2.7-Code-MLX-3.6bit" --prompt "Once upon a time"
Uploading... ๐
moonshotai/Kimi-K2.7-Code optimized for running on a Mac Studio M3 Ultra.
- A mixed-precision quant that balances speed, memory, and accuracy.
- 3-bit MoE baseline with important always-on layers at higher precision.
- Fits into ~460 GB memory, leaving enough room for a smaller utility model.
Usage
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/Kimi-K2.7-Code-MLX-3.6bit
Benchmarks
TBD
Methodology
Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision
Model tree for spicyneuron/Kimi-K2.7-Code-MLX-3.6bit
Base model
moonshotai/Kimi-K2.7-Code