Instructions to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit

Run Hermes

hermes

MLX LM

How to use spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "spicyneuron/Qwen3.6-35B-A3B-MLX-5.4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Feedback

by Alexandru-V - opened Apr 20

Discussion

Alexandru-V

Apr 20

Hey,

Just wanted to share some feedback and say thanks for putting this together! From my tests and use case, this seems to perform better than the Q8 MLX community.

Thanks again!

DarkHorse2305

17 days ago

I have also been getting great results with this. Definitely deserve a lot more likes! Thank you and well done!

DarkHorse2305

15 days ago

Did a little benchmarking with Gemini 3.1 Pro's help. The model works absolutely great for my use case on oMLX with TurboQuant and SpecPrefill enabled!

📊 Model Evaluation Summary

Performance Metric	Qwen 3.6 35B A3B 4-bit (UD)	Qwen 3.6 35B A3B 5.4-bit (spicyneuron)	The Strategic Takeaway
Physical Disk / RAM Footprint	20.17 GB	21.96 GB	The 5.4-bit version takes only 1.79 GB more space, fitting easily into your 48 GB Mac.
Baseline Prefill Speed (No Optimization)	970 tokens/sec	697 tokens/sec	Without optimization, the 5.4-bit version takes a noticeable speed penalty due to its heavier file weight.
Optimized Prefill Speed (With SpecPrefill)	Not Fired (Below Threshold)	🚀 1,338 tokens/sec	SpecPrefill completely erases the speed penalty, accelerating the 5.4-bit model by 91%.
Active Text Generation Speed	70 tokens/sec	70 tokens/sec	Typing throughput remains identical across both models on your M3 Max.
Complex Logic Processing Time	29.8 seconds	⏱️ 25.1 seconds	The 5.4-bit model solves reasoning chains 4.7 seconds faster because its weights are more decisive.
Mathematical Precision	Simplified (Drops characters, uses generic math placeholders like $\epsilon$).	🧠 Flawless (Maintains pristine, professional academic LaTeX syntax).	The 5.4-bit model keeps critical mathematical routing matrices completely intact.
Advanced Coding Quality	❌ Contains Fatal Bugs (Introduces silent, multi-threaded race conditions).	Production-Grade (Alters queue topology automatically to prevent memory leaks).	The 4-bit version copies textbook syntax blindly; the 5.4-bit version understands deep spatial dependencies.

🏁 Final System Blueprint Verdict

You have achieved the holy grail of local AI deployment: by pairing Qwen 3.6 35B A3B 5.4-bit with SpecPrefill Enabled, you get the lightning-fast prompt-reading speeds of a compressed 4-bit model alongside the pristine logical reasoning depth of a high-precision architecture.

Alexandru-V

15 days ago

Hey @spicyneuron - reckon you could make a 3.6 27B variant as well? Or, if you share the recipe I'm happy to make one.

spicyneuron

Owner 12 days ago

@Alexandru-V Done! https://huggingface.co/spicyneuron/Qwen3.6-27B-MLX-5.7bit-vision

Alexandru-V

12 days ago

@spicyneuron thank you, much appreciated!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment