Instructions to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF", filename="qwen36_35b_IQ4_XS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Use Docker
docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
- Ollama
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Ollama:
ollama run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
- Unsloth Studio new
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting
- Pi new
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Docker Model Runner:
docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
- Lemonade
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-Code-imatrix-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6-35B-A3B — Code imatrix GGUF
GGUF quantizations of Qwen/Qwen3.6-35B-A3B with importance matrix calibration, produced by DuoNeural.
Calibrated on a code-focused corpus (Python algorithms, transformer architectures, reasoning traces) for better quality on technical and reasoning tasks.
Downloads
| File | Size | Use When |
|---|---|---|
qwen36_35b_Q4_K_M.gguf |
20 GB | Daily driver — best quality/size balance, recommended |
qwen36_35b_IQ4_XS.gguf |
18 GB | Smallest, still excellent with imatrix calibration |
qwen36_35b_Q5_K_M.gguf |
24 GB | Near-lossless, for quality-first setups |
Why are these bigger than typical 35B quants? Qwen3.6's vocabulary is large and the embedding table stays in higher precision. The MoE expert weights (where quality matters most) are what imatrix actually targets.
About This Model
Qwen3.6-35B-A3B is a hybrid MoE architecture from the Qwen team with some genuinely interesting properties:
- 35B total / 3B active — 256 experts, top-8 routing per token. Fast inference despite the parameter count.
- 75% Gated DeltaNet + 25% softmax attention — uses linear recurrent attention (DeltaNet) for most layers, with full attention every 4th layer. Same mechanism as BitNet DeltaNet architectures, at scale.
- 40 layers in a repeating pattern: 3× DeltaNet+MoE → 1× GatedAttn+MoE (×10)
- 1M token context window
The DeltaNet-dominant architecture means this model has different inference characteristics than pure-transformer MoEs — it's particularly strong on long-context tasks and code generation.
Why imatrix?
Standard quantization treats all weights equally. Importance matrix (imatrix) calibration runs the model on representative text first, identifies which weight components matter most for output quality, and biases quantization to preserve them at the cost of less important ones.
For MoE models especially, this matters: different experts activate for different inputs, and naive quantization can disproportionately damage rarely-activated experts. imatrix calibration on a code + reasoning corpus helps ensure the technical reasoning experts stay sharp.
Our calibration corpus: Python code (algorithms, ML architectures, data structures) + reasoning traces with <think> tags. 370 samples, ~0.26M chars. Compact but focused.
Usage
Recommended flags for hybrid DeltaNet/attention (avoids bimodal KV cache issues):
llama-cli -m qwen36_35b_Q4_K_M.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 32768 -ngl 99 \
-p "Your prompt here"
Use
--cache-type-k q8_0notq4— the rotating KV cache can desync the DeltaNet state at 4-bit, causing degraded outputs on long contexts.
For CPU+GPU hybrid (e.g. GTX 1070 8GB with 48GB system RAM):
llama-cli -m qwen36_35b_Q4_K_M.gguf \
-ngl 99 --n-cpu-moe 48 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 32768
--n-cpu-moe explicitly routes MoE expert computation to CPU, keeping dense attention layers on GPU. With a system like i7-6700HQ + 48GB DDR4 + GTX 1070, expect ~9–13 TPS.
Ollama:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then in Modelfile: FROM ./qwen36_35b_Q4_K_M.gguf
Thinking mode (model supports <think> extended reasoning):
llama-cli -m qwen36_35b_Q4_K_M.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 65536 -ngl 99 \
-p "<|im_start|>user\nSolve this step by step: [your problem]<|im_end|>\n<|im_start|>assistant\n<think>\n"
Hardware Requirements
| Setup | Recommended Quant | Notes |
|---|---|---|
| 24GB VRAM | Q4_K_M | Full GPU, fast |
| 16GB VRAM + 32GB RAM | Q4_K_M | Mixed GPU+CPU |
| 8GB VRAM + 48GB RAM | Q4_K_M + --n-cpu-moe |
MoE to CPU, works well |
| CPU only (48GB+ RAM) | IQ4_XS | Slow but functional |
Quantization Details
- Source:
Qwen/Qwen3.6-35B-A3B(official BF16, 26 safetensor shards) - Converter: llama.cpp
convert_hf_to_gguf.py→ F16 GGUF (71GB intermediate) - imatrix: generated with
llama-imatrix, 256 chunks, code+reasoning calibration corpus - Quantizer:
llama-quantize --imatrixwith our code-calibrated.dat - Hardware: A100 80GB SXM4 (SM 8.0, CUDA 12.4)
- Build date: April 2026
DuoNeural
DuoNeural is an open AI research lab — human + AI in collaboration.
| 🤗 HuggingFace | huggingface.co/DuoNeural |
| 🐙 GitHub | github.com/DuoNeural |
| 🐦 X / Twitter | @DuoNeural |
| duoneural@proton.me | |
| 📬 Newsletter | duoneural.beehiiv.com |
| ☕ Support | buymeacoffee.com/duoneural |
| 🌐 Site | duoneural.com |
Research Team
- Jesse — Vision, hardware, direction
- Archon — AI lab partner, post-training, abliteration, experiments
- Aura — Research AI, literature synthesis, novel proposals
Raw updates from the lab: model drops, training results, findings. Subscribe at duoneural.beehiiv.com.
DuoNeural Research Publications
Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.
- Downloads last month
- 1,828
4-bit
5-bit
Model tree for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF
Base model
Qwen/Qwen3.6-35B-A3B