Instructions to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit
Run Hermes
hermes
- MLX LM
How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Feedback
Hey,
Just wanted to share some feedback and say thanks for putting this together! From my tests and use case, this seems to perform better than the Q8 MLX community.
Thanks again!
I have also been getting great results with this. Definitely deserve a lot more likes! Thank you and well done!
Did a little benchmarking with Gemini 3.1 Pro's help. The model works absolutely great for my use case on oMLX with TurboQuant and SpecPrefill enabled!
π Model Evaluation Summary
| Performance Metric | Qwen 3.6 35B A3B 4-bit (UD) | Qwen 3.6 35B A3B 5.4-bit (spicyneuron) | The Strategic Takeaway |
|---|---|---|---|
| Physical Disk / RAM Footprint | 20.17 GB | 21.96 GB | The 5.4-bit version takes only 1.79 GB more space, fitting easily into your 48 GB Mac. |
| Baseline Prefill Speed (No Optimization) | 970 tokens/sec | 697 tokens/sec | Without optimization, the 5.4-bit version takes a noticeable speed penalty due to its heavier file weight. |
| Optimized Prefill Speed (With SpecPrefill) | Not Fired (Below Threshold) | π 1,338 tokens/sec | SpecPrefill completely erases the speed penalty, accelerating the 5.4-bit model by 91%. |
| Active Text Generation Speed | 70 tokens/sec | 70 tokens/sec | Typing throughput remains identical across both models on your M3 Max. |
| Complex Logic Processing Time | 29.8 seconds | β±οΈ 25.1 seconds | The 5.4-bit model solves reasoning chains 4.7 seconds faster because its weights are more decisive. |
| Mathematical Precision | Simplified (Drops characters, uses generic math placeholders like $\epsilon$). | π§ Flawless (Maintains pristine, professional academic LaTeX syntax). | The 5.4-bit model keeps critical mathematical routing matrices completely intact. |
| Advanced Coding Quality | β Contains Fatal Bugs (Introduces silent, multi-threaded race conditions). | Production-Grade (Alters queue topology automatically to prevent memory leaks). | The 4-bit version copies textbook syntax blindly; the 5.4-bit version understands deep spatial dependencies. |
π Final System Blueprint Verdict
You have achieved the holy grail of local AI deployment: by pairing Qwen 3.6 35B A3B 5.4-bit with SpecPrefill Enabled, you get the lightning-fast prompt-reading speeds of a compressed 4-bit model alongside the pristine logical reasoning depth of a high-precision architecture.
Hey @spicyneuron - reckon you could make a 3.6 27B variant as well? Or, if you share the recipe I'm happy to make one.