Instructions to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.6-35B-A3B optimized for MLX.
- 4-bit baseline with important layers at 8-bit and BF16.
- This quant does not support image input.
I ended up selecting two winners from my trials. This is the quality+ version, and here's the speed+ version.
Usage
# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit
Benchmarks
| metric | mlx-community/ Qwen3.6-35B-A3B-4bit | mlx-community/ Qwen3.6-35B-A3B-4.4bit-msq | 4.8 bit | 5.4 bit (this model) |
|---|---|---|---|---|
| bpw | 4.503 | 4.787 | 4.788 | 5.438 |
| peak memory (1024/512) | 20.683 | 21.922 | 21.928 | 24.741 |
| prompt tok/s (1024) | 2719.4470 ± 15.2250 | 2695.9370 ± 12.5260 | 2734.5260 ± 3.8810 | 2665.3060 ± 11.4520 |
| gen tok/s (512) | 108.4990 ± 0.4910 | 94.2940 ± 0.3650 | 97.2820 ± 0.0800 | 89.4920 ± 0.2610 |
| kl divergence | 0.0838 ± 0.0008 | 0.1689 ± 0.0015 | 0.0244 ± 0.0004 | 0.0189 ± 0.0003 |
| perplexity | 4.6150 ± 0.0320 | 4.2490 ± 0.0280 | 4.6410 ± 0.0320 | 4.6440 ± 0.0320 |
| hellaswag | 0.5560 ± 0.0220 | 0.5780 ± 0.0220 | 0.5440 ± 0.0220 | 0.5370 ± 0.0110 |
| piqa | 0.7940 ± 0.0180 | 0.7920 ± 0.0180 | 0.7920 ± 0.0180 | 0.7980 ± 0.0180 |
| winogrande | 0.7260 ± 0.0200 | 0.7400 ± 0.0200 | 0.7120 ± 0.0200 | 0.7100 ± 0.0200 |
I've moved over to using speed + KL divergence as my primary optimization metrics. Hellaswag, PIQA, Winogrande, and perplexity are kept as sanity checks, though these require high sample sizes to get usable signal.
Tested on a Mac Studio M3 Ultra with:
mlx_lm.convert --hf-path Qwen/Qwen3.6-35B-A3B --mlx-path ./mlx && mlx_lm.kld --baseline-model ./mlx
mlx_lm.perplexity --sequence-length 512 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 500
mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 500
mlx_lm.kld is still an open PR.
Methodology
Quantized with a mlx-lm fork, drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision
- Downloads last month
- 1,648
4-bit
Model tree for spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit
Base model
Qwen/Qwen3.6-35B-A3B