Instructions to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF",
	filename="qwen36_35b_IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Use Docker

docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Ollama
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Ollama:
```
ollama run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
```

Unsloth Studio new

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF to start chatting

Pi new

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Docker Model Runner:
```
docker model run hf.co/DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M
```

Lemonade

How to use DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-Code-imatrix-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-35B-A3B — Code imatrix GGUF

GGUF quantizations of Qwen/Qwen3.6-35B-A3B with importance matrix calibration, produced by DuoNeural.

Calibrated on a code-focused corpus (Python algorithms, transformer architectures, reasoning traces) for better quality on technical and reasoning tasks.

Downloads

File	Size	Use When
`qwen36_35b_Q4_K_M.gguf`	20 GB	Daily driver — best quality/size balance, recommended
`qwen36_35b_IQ4_XS.gguf`	18 GB	Smallest, still excellent with imatrix calibration
`qwen36_35b_Q5_K_M.gguf`	24 GB	Near-lossless, for quality-first setups

Why are these bigger than typical 35B quants? Qwen3.6's vocabulary is large and the embedding table stays in higher precision. The MoE expert weights (where quality matters most) are what imatrix actually targets.

About This Model

Qwen3.6-35B-A3B is a hybrid MoE architecture from the Qwen team with some genuinely interesting properties:

35B total / 3B active — 256 experts, top-8 routing per token. Fast inference despite the parameter count.
75% Gated DeltaNet + 25% softmax attention — uses linear recurrent attention (DeltaNet) for most layers, with full attention every 4th layer. Same mechanism as BitNet DeltaNet architectures, at scale.
40 layers in a repeating pattern: 3× DeltaNet+MoE → 1× GatedAttn+MoE (×10)
1M token context window

The DeltaNet-dominant architecture means this model has different inference characteristics than pure-transformer MoEs — it's particularly strong on long-context tasks and code generation.

Why imatrix?

Standard quantization treats all weights equally. Importance matrix (imatrix) calibration runs the model on representative text first, identifies which weight components matter most for output quality, and biases quantization to preserve them at the cost of less important ones.

For MoE models especially, this matters: different experts activate for different inputs, and naive quantization can disproportionately damage rarely-activated experts. imatrix calibration on a code + reasoning corpus helps ensure the technical reasoning experts stay sharp.

Our calibration corpus: Python code (algorithms, ML architectures, data structures) + reasoning traces with <think> tags. 370 samples, ~0.26M chars. Compact but focused.

Usage

Recommended flags for hybrid DeltaNet/attention (avoids bimodal KV cache issues):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 32768 -ngl 99 \
  -p "Your prompt here"

Use --cache-type-k q8_0 not q4 — the rotating KV cache can desync the DeltaNet state at 4-bit, causing degraded outputs on long contexts.

For CPU+GPU hybrid (e.g. GTX 1070 8GB with 48GB system RAM):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  -ngl 99 --n-cpu-moe 48 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 32768

--n-cpu-moe explicitly routes MoE expert computation to CPU, keeping dense attention layers on GPU. With a system like i7-6700HQ + 48GB DDR4 + GTX 1070, expect ~9–13 TPS.

Ollama:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then in Modelfile: FROM ./qwen36_35b_Q4_K_M.gguf

Thinking mode (model supports <think> extended reasoning):

llama-cli -m qwen36_35b_Q4_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 65536 -ngl 99 \
  -p "<|im_start|>user\nSolve this step by step: [your problem]<|im_end|>\n<|im_start|>assistant\n<think>\n"

Hardware Requirements

Setup	Recommended Quant	Notes
24GB VRAM	Q4_K_M	Full GPU, fast
16GB VRAM + 32GB RAM	Q4_K_M	Mixed GPU+CPU
8GB VRAM + 48GB RAM	Q4_K_M + `--n-cpu-moe`	MoE to CPU, works well
CPU only (48GB+ RAM)	IQ4_XS	Slow but functional

Quantization Details

Source: Qwen/Qwen3.6-35B-A3B (official BF16, 26 safetensor shards)
Converter: llama.cpp convert_hf_to_gguf.py → F16 GGUF (71GB intermediate)
imatrix: generated with llama-imatrix, 256 chunks, code+reasoning calibration corpus
Quantizer: llama-quantize --imatrix with our code-calibrated .dat
Hardware: A100 80GB SXM4 (SM 8.0, CUDA 12.4)
Build date: April 2026

DuoNeural

DuoNeural is an open AI research lab — human + AI in collaboration.


🤗 HuggingFace	huggingface.co/DuoNeural
🐙 GitHub	github.com/DuoNeural
🐦 X / Twitter	@DuoNeural
📧 Email	duoneural@proton.me
📬 Newsletter	duoneural.beehiiv.com
☕ Support	buymeacoffee.com/duoneural
🌐 Site	duoneural.com

Research Team

Jesse — Vision, hardware, direction
Archon — AI lab partner, post-training, abliteration, experiments
Aura — Research AI, literature synthesis, novel proposals

Raw updates from the lab: model drops, training results, findings. Subscribe at duoneural.beehiiv.com.

DuoNeural Research Publications

Title	DOI
Nano-CTM: Ternary Continuous Thought Machines with Thought-Space Self-Prediction for Efficient Iterative Reasoning	10.5281/zenodo.19775622
Recurrence as World Model: CTM Learns Implicit Belief States in Partially Observable Physical Environments	10.5281/zenodo.19810620
Per-Object Slot Decomposition for Scalable Neural World Modeling: When Does Attention Beat Mean-Field?	10.5281/zenodo.19846804

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.

Downloads last month: 1,828

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

5-bit

Model tree for DuoNeural/Qwen3.6-35B-A3B-Code-imatrix-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(404)

this model