Instructions to use ManiacLabs/Qwen3.6-35B-A3B-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ManiacLabs/Qwen3.6-35B-A3B-2bit",
	filename="qwen3.6-35b-a3b-iq2xxs-q2k.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
# Run inference directly in the terminal:
llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
# Run inference directly in the terminal:
llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
# Run inference directly in the terminal:
./llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Use Docker

docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit

LM Studio
Jan

vLLM

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ManiacLabs/Qwen3.6-35B-A3B-2bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManiacLabs/Qwen3.6-35B-A3B-2bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit

Ollama
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Ollama:
```
ollama run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
```

Unsloth Studio

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ManiacLabs/Qwen3.6-35B-A3B-2bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ManiacLabs/Qwen3.6-35B-A3B-2bit

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Docker Model Runner:
```
docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
```

Lemonade

How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ManiacLabs/Qwen3.6-35B-A3B-2bit

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-2bit-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3.6-35B-A3B 2-bit GGUF — teacher-level agentic tool use at ~12 GiB

The strongest 2-bit quant of Qwen/Qwen3.6-35B-A3B we know of for agentic work — and not just on one benchmark. On τ²-bench (multi-turn agentic tool use) it matches or beats every model in our cohort, including the bf16 teacher itself, and it beats Unsloth's same-size 2-bit across all three agentic benchmarks we measured: BFCL, τ²-bench, and SWE-bench Verified. One standard GGUF file, mainline-llama.cpp native — runs on a 16 GB Mac or a 12 GB GPU.

Quant	File size	BFCL composite ↑	τ²-bench macro ↑	SWE-bench Verified ↑
Maniac 2-bit (this repo, IQ2_XXS+Q2_K)	11.8 GiB	0.6338	0.686	0.39
Unsloth UD-Q2_K_XL (2-bit)	11.45 GiB	0.600	0.650	0.24
Unsloth UD-Q4_K_XL (4-bit)	20.82 GiB	0.6696	0.651	0.30
bf16 teacher (reference ceiling)	~66 GiB	0.6777	0.647	—

Every column is measured by us with the identical harness and settings for all rows in that column (full protocols below): BFCL via mainline llama.cpp + bfcl-eval, full-N, thinking OFF, temp 0; τ²-bench via the official Sierra harness, all 278 base tasks, fixed user-simulator model; SWE-bench Verified via mini-SWE-agent on the first-100 slice (our row served via ik_llama, Unsloth rows via mainline llama.cpp — same scaffold and decoding). Honest caveats: τ² is single-trial (pass^1), so we read it conservatively as "no measurable regression vs the teacher on multi-turn tool use" rather than a strict win; and on absolute BFCL, Unsloth's 4-bit is still better (0.6696 vs 0.6338) at 1.8× the size — this repo wins on quality-per-gigabyte and on fitting in 12–16 GB.

Maniac engine format: Qwen3.6-35B-A3B-2bit-maniac — the same quant repackaged for on-device serving in the Maniac app.

Download

One file, no splits, no sidecars:

Filename	Quant type	File size	Description
qwen3.6-35b-a3b-iq2xxs-q2k.gguf	IQ2_XXS / Q2_K mix (~2.1 bpw)	11.8 GiB	Imatrix-calibrated 2-bit, tuned for tool-calling / agentic use. Fully resident in 12 GiB VRAM or a 16 GB Mac. Recommended.

pip install -U "huggingface_hub[cli]"
huggingface-cli download ManiacLabs/Qwen3.6-35B-A3B-2bit qwen3.6-35b-a3b-iq2xxs-q2k.gguf --local-dir ./

Which file should I choose? There is only one — that's the point. If you have 12–16 GB of RAM/VRAM and want the best agentic 2-bit Qwen3.6-35B, this is it. If you have 24 GB+, a good 4-bit quant (e.g. Unsloth UD-Q4_K_XL) is absolutely stronger and worth the extra 9 GiB.

Quickstart

This is a standard GGUF using only mainline ggml quant types (IQ2_XXS, Q2_K) — it loads in stock llama.cpp, ik_llama, LM Studio, and any other llama.cpp-based app. No custom build needed.

llama.cpp (OpenAI-compatible server)

# Tool-use / function-calling (recommended: thinking OFF)
llama-server \
  --model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
  --jinja \
  --reasoning-budget 0 \
  -fa on \
  -ngl 99

# One-shot CLI
llama-cli \
  --model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
  --jinja --reasoning-budget 0 \
  -p "Write a Python function to compute Fibonacci numbers."

Thinking mode

Qwen3.6 is a hybrid thinking model and defaults to thinking ON:

Tool-use / agents: run thinking OFF (--reasoning-budget 0, or enable_thinking=false in the chat template). All BFCL numbers on this card are measured thinking-OFF.
Hard reasoning tasks: leave thinking ON (omit --reasoning-budget 0).

Recommended sampling

Tool-use / agents (matches our benchmark setup): temperature=0 (greedy)
General chat, non-thinking: temperature=0.7, top_p=0.8, top_k=20 (upstream Qwen guidance)
Thinking mode: temperature=1.0, top_p=0.95, top_k=20 (upstream Qwen guidance)

Maniac desktop app

This model is also available on-device via the Maniac app catalog — select Qwen3.6-35B-A3B 2-bit in the local models pane and the app handles download and serving. The app runs a separate format optimized for its own engine (Qwen3.6-35B-A3B-2bit-maniac), not this GGUF file.

Will it run on my machine?

Almost certainly, and not just on Macs: this is a standard GGUF that runs via llama.cpp/ik_llama on any Apple Silicon Mac with 16 GB+ unified memory (M1 and later), on Windows/Linux CPUs with ~13 GB of free RAM, and on any CUDA GPU with 13 GB+ VRAM — plus partial-offload setups below that.

Hardware	Fits?	Notes
16 GB Apple Silicon Mac (any M1 or later)	✅ fully resident	The headline use case — a 35B-class agentic model on a 16 GB laptop
12 GB GPU (RTX 3060 12GB and up)	✅ fully resident	`-ngl 99`, full GPU offload
8 GB GPU / RAM	⚠️ partial offload	Works with CPU offload, slower
24 GB+ GPU / 32 GB+ Mac	✅	Consider a 4-bit quant instead for max quality
Disk	11.8 GiB	Single file

Decode is fast for the size class: Qwen3.6-35B-A3B is a sparse MoE with only ~3B active parameters per token, so token generation costs roughly what a 3B dense model does at equal bandwidth.

Benchmarks

Everything in this section was measured by us, with the protocol stated inline. The headline claim (vs Unsloth) uses byte-identical harnesses and settings for every model in the table.

Function calling — BFCL (full-N, same harness for all rows)

Protocol: mainline ggml-org/llama.cpp llama-server --jinja, server-side structured tool_calls over /v1/chat/completions, scored with bfcl-eval 2025.10.27.1 (DeepSeek-V3.2-Exp-FC OpenAI-tools handler pointed at the local server). Full-N per category: 200 / 200 / 1053 / 240. Thinking OFF, temp 0 (greedy). Composite = mean of the four categories.

Model	multi_turn_base	multi_turn_miss_param	live_multiple	irrelevance	Composite
bf16 teacher (ceiling)	0.665	0.360	0.782	0.904	0.6777
Maniac 2-bit (this repo)	0.650	0.335	0.804	0.746	0.6338
Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB)	0.660	0.340	0.783	0.896	0.6696
Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB)	0.600	0.310	0.782	0.708	0.600

Takeaways:

vs Unsloth 2-bit: ahead in every category (+5.0 mtb, +2.5 miss_param, +2.2 live_multiple, +3.8 irrelevance) at near-identical size.
vs the bf16 teacher: within 0.044 composite at ~18% of the bytes; live_multiple (0.804) actually exceeds the teacher (0.782).
vs Unsloth 4-bit: behind on absolute quality (−0.036), mostly on irrelevance/abstention — see Limitations. The win is quality-per-GiB, not absolute quality.

SWE-bench Verified (coding agent, N=100)

Protocol: first-100 slice of princeton-nlp/SWE-bench_Verified (split=test, slice 0:100), mini-SWE-agent scaffold, grammar-constrained (grammar-ON) single-action output format, thinking ON, temp 0, 3072 max output tokens, 75-step limit. Score = fraction of instances resolved.

Model	Resolved (N=100)
Maniac 2-bit (this repo)	0.39 (39/100)
Unsloth UD-Q4_K_XL (4-bit)	0.30 (30/100)
Unsloth UD-Q2_K_XL (2-bit)	0.24 (24/100)

Serving-engine caveat: identical agent scaffold, grammar constraint, dataset slice, and decoding for all rows, but this repo's run was served via ik_llama while the Unsloth rows were served via mainline llama.cpp (both --jinja).

τ²-bench (multi-turn conversational tool-use, N=278)

Protocol: official τ²-bench (Sierra; v1.0.0, pinned commit 1746a25), driven through the official tau2.run.run_domain Python API — environments, tasks, and scoring are the upstream code, not a reimplementation. Full base task split across all three domains: airline (50) + telecom (114) + retail (114) = 278 tasks. Agent uses native structured tool_calls, thinking OFF, temp 0, single trial (headline = pass^1 = avg_reward); composite = unweighted macro-mean of per-domain avg_reward. The user simulator and the retail NL-assertions judge are one fixed model (gpt-5.4-mini, temp 0) held constant across every row, so only the agent model varies. That makes the rows internally comparable, but not directly comparable to the public τ²-bench leaderboard (which uses a gpt-4.1 user simulator). All four rows served on mainline llama.cpp llama-server --jinja — same engine, same harness.

Model	airline	telecom	retail	Macro avg (pass^1) ↑
Maniac 2-bit (this repo)	0.620	0.868	0.570	0.686
Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB)	0.620	0.772	0.561	0.651
Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB)	0.600	0.833	0.518	0.650
bf16 teacher (~66 GiB)	0.580	0.833	0.526	0.647

Takeaways:

Best row in the cohort on this harness — ahead of Unsloth's 2-bit (+3.6 macro points), Unsloth's 4-bit (+3.5), and even the bf16 teacher (+4.0). Single-trial (pass^1) numbers carry noise, so we'd summarize it conservatively as: the 2-bit quant shows no measurable regression on multi-turn agentic tool use.
Retail is the hardest domain for every row (it includes LLM-judged natural-language assertions); telecom (deterministic env/action scoring) is where this quant is strongest (0.868).

General benchmarks (this model)

Benchmark	Score	Protocol
IFEval	0.813 (prompt-strict)	lm-eval-harness, full 541, thinking ON
HumanEval	0.951 (pass@1)	greedy
MuSR	0.591	lm-eval-harness
MMLU-Redux	~0.842	subset N=741

These are measured under slightly different serve settings than the Unsloth lane (noted protocols), so we don't present them as a head-to-head table — the same-harness comparisons above are BFCL, SWE-bench, and τ²-bench.

What's in the file

Only mainline ggml quant types — all of this is inspectable in the GGUF metadata:

Tensor group	Format	bpw
Expert gate / up projections	IQ2_XXS	2.0625
Expert down projections	Q2_K	~2.6–3.0
Non-expert weights (attention, norms, embeddings, output)	Q2_K	~2.6–3.0

Quantized directly from the Qwen/Qwen3.6-35B-A3B bf16 release weights with imatrix calibration using llama.cpp/ik_llama tooling. Effective ~2.1 bpw, 11.8 GiB on disk.

Model details

Property	Value
Architecture	`qwen3next` — hybrid attention + linear-attention (Gated DeltaNet) sparse MoE
Parameters	35B total / ~3B active per token
Experts	256 routed, top-8 + 1 shared
Context	262,144 tokens native
File	`qwen3.6-35b-a3b-iq2xxs-q2k.gguf` (11.8 GiB)
License	Apache 2.0 (inherited from base model)

Limitations

It's still 2-bit. Composite agentic quality sits ~0.044 below the bf16 teacher and ~0.036 below a good 4-bit quant. If you have the RAM for 4-bit, use 4-bit.
Abstention is the main regression. BFCL irrelevance is 0.746 vs the teacher's 0.904: the 2-bit model is more likely to attempt a tool call when it should decline. If your agent has destructive tools, gate them.
Run tool-use with thinking OFF. With thinking ON the model may emit reasoning tokens before/around tool calls; all function-calling numbers here are thinking-OFF (--reasoning-budget 0).
qwen3next engine support is still maturing. Mainline llama.cpp (recent builds) serves this model cleanly on CUDA and Metal. On ik_llama's Metal backend, batched prefill on this architecture can be unstable — prefer single-stream (--parallel 1) or mainline llama.cpp on Macs.

Provenance & license

Base model: Qwen/Qwen3.6-35B-A3B by the Qwen team.
Quantized by: Maniac (ManiacLabs).
License: Apache 2.0, inherited from the base model.

If you use this model, please credit both Qwen (base model) and Maniac (quantization).

@misc{maniac-qwen3.6-35b-a3b-2bit,
  title        = {Maniac Qwen3.6-35B-A3B 2-bit GGUF (IQ2\_XXS + Q2\_K)},
  author       = {Maniac},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ManiacLabs/Qwen3.6-35B-A3B-2bit}},
  note         = {Imatrix-calibrated 2-bit GGUF of Qwen3.6-35B-A3B. BFCL composite 0.6338.}
}

Downloads last month: 323

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for ManiacLabs/Qwen3.6-35B-A3B-2bit

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(473)

this model