Qwen3.6-35B-A3B 2-bit GGUF — teacher-level agentic tool use at ~12 GiB

The strongest 2-bit quant of Qwen/Qwen3.6-35B-A3B we know of for agentic work — and not just on one benchmark. On τ²-bench (multi-turn agentic tool use) it matches or beats every model in our cohort, including the bf16 teacher itself, and it beats Unsloth's same-size 2-bit across all three agentic benchmarks we measured: BFCL, τ²-bench, and SWE-bench Verified. One standard GGUF file, mainline-llama.cpp native — runs on a 16 GB Mac or a 12 GB GPU.

Quant File size BFCL composite ↑ τ²-bench macro ↑ SWE-bench Verified ↑
Maniac 2-bit (this repo, IQ2_XXS+Q2_K) 11.8 GiB 0.6338 0.686 0.39
Unsloth UD-Q2_K_XL (2-bit) 11.45 GiB 0.600 0.650 0.24
Unsloth UD-Q4_K_XL (4-bit) 20.82 GiB 0.6696 0.651 0.30
bf16 teacher (reference ceiling) ~66 GiB 0.6777 0.647

Every column is measured by us with the identical harness and settings for all rows in that column (full protocols below): BFCL via mainline llama.cpp + bfcl-eval, full-N, thinking OFF, temp 0; τ²-bench via the official Sierra harness, all 278 base tasks, fixed user-simulator model; SWE-bench Verified via mini-SWE-agent on the first-100 slice (our row served via ik_llama, Unsloth rows via mainline llama.cpp — same scaffold and decoding). Honest caveats: τ² is single-trial (pass^1), so we read it conservatively as "no measurable regression vs the teacher on multi-turn tool use" rather than a strict win; and on absolute BFCL, Unsloth's 4-bit is still better (0.6696 vs 0.6338) at 1.8× the size — this repo wins on quality-per-gigabyte and on fitting in 12–16 GB.

Maniac engine format: Qwen3.6-35B-A3B-2bit-maniac — the same quant repackaged for on-device serving in the Maniac app.


Download

One file, no splits, no sidecars:

Filename Quant type File size Description
qwen3.6-35b-a3b-iq2xxs-q2k.gguf IQ2_XXS / Q2_K mix (~2.1 bpw) 11.8 GiB Imatrix-calibrated 2-bit, tuned for tool-calling / agentic use. Fully resident in 12 GiB VRAM or a 16 GB Mac. Recommended.
pip install -U "huggingface_hub[cli]"
huggingface-cli download ManiacLabs/Qwen3.6-35B-A3B-2bit qwen3.6-35b-a3b-iq2xxs-q2k.gguf --local-dir ./

Which file should I choose? There is only one — that's the point. If you have 12–16 GB of RAM/VRAM and want the best agentic 2-bit Qwen3.6-35B, this is it. If you have 24 GB+, a good 4-bit quant (e.g. Unsloth UD-Q4_K_XL) is absolutely stronger and worth the extra 9 GiB.


Quickstart

This is a standard GGUF using only mainline ggml quant types (IQ2_XXS, Q2_K) — it loads in stock llama.cpp, ik_llama, LM Studio, and any other llama.cpp-based app. No custom build needed.

llama.cpp (OpenAI-compatible server)

# Tool-use / function-calling (recommended: thinking OFF)
llama-server \
  --model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
  --jinja \
  --reasoning-budget 0 \
  -fa on \
  -ngl 99
# One-shot CLI
llama-cli \
  --model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
  --jinja --reasoning-budget 0 \
  -p "Write a Python function to compute Fibonacci numbers."

Thinking mode

Qwen3.6 is a hybrid thinking model and defaults to thinking ON:

  • Tool-use / agents: run thinking OFF (--reasoning-budget 0, or enable_thinking=false in the chat template). All BFCL numbers on this card are measured thinking-OFF.
  • Hard reasoning tasks: leave thinking ON (omit --reasoning-budget 0).

Recommended sampling

  • Tool-use / agents (matches our benchmark setup): temperature=0 (greedy)
  • General chat, non-thinking: temperature=0.7, top_p=0.8, top_k=20 (upstream Qwen guidance)
  • Thinking mode: temperature=1.0, top_p=0.95, top_k=20 (upstream Qwen guidance)

Maniac desktop app

This model is also available on-device via the Maniac app catalog — select Qwen3.6-35B-A3B 2-bit in the local models pane and the app handles download and serving. The app runs a separate format optimized for its own engine (Qwen3.6-35B-A3B-2bit-maniac), not this GGUF file.


Will it run on my machine?

Almost certainly, and not just on Macs: this is a standard GGUF that runs via llama.cpp/ik_llama on any Apple Silicon Mac with 16 GB+ unified memory (M1 and later), on Windows/Linux CPUs with ~13 GB of free RAM, and on any CUDA GPU with 13 GB+ VRAM — plus partial-offload setups below that.

Hardware Fits? Notes
16 GB Apple Silicon Mac (any M1 or later) ✅ fully resident The headline use case — a 35B-class agentic model on a 16 GB laptop
12 GB GPU (RTX 3060 12GB and up) ✅ fully resident -ngl 99, full GPU offload
8 GB GPU / RAM ⚠️ partial offload Works with CPU offload, slower
24 GB+ GPU / 32 GB+ Mac Consider a 4-bit quant instead for max quality
Disk 11.8 GiB Single file

Decode is fast for the size class: Qwen3.6-35B-A3B is a sparse MoE with only ~3B active parameters per token, so token generation costs roughly what a 3B dense model does at equal bandwidth.


Benchmarks

Everything in this section was measured by us, with the protocol stated inline. The headline claim (vs Unsloth) uses byte-identical harnesses and settings for every model in the table.

Function calling — BFCL (full-N, same harness for all rows)

Protocol: mainline ggml-org/llama.cpp llama-server --jinja, server-side structured tool_calls over /v1/chat/completions, scored with bfcl-eval 2025.10.27.1 (DeepSeek-V3.2-Exp-FC OpenAI-tools handler pointed at the local server). Full-N per category: 200 / 200 / 1053 / 240. Thinking OFF, temp 0 (greedy). Composite = mean of the four categories.

Model multi_turn_base multi_turn_miss_param live_multiple irrelevance Composite
bf16 teacher (ceiling) 0.665 0.360 0.782 0.904 0.6777
Maniac 2-bit (this repo) 0.650 0.335 0.804 0.746 0.6338
Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB) 0.660 0.340 0.783 0.896 0.6696
Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB) 0.600 0.310 0.782 0.708 0.600

Takeaways:

  • vs Unsloth 2-bit: ahead in every category (+5.0 mtb, +2.5 miss_param, +2.2 live_multiple, +3.8 irrelevance) at near-identical size.
  • vs the bf16 teacher: within 0.044 composite at ~18% of the bytes; live_multiple (0.804) actually exceeds the teacher (0.782).
  • vs Unsloth 4-bit: behind on absolute quality (−0.036), mostly on irrelevance/abstention — see Limitations. The win is quality-per-GiB, not absolute quality.

SWE-bench Verified (coding agent, N=100)

Protocol: first-100 slice of princeton-nlp/SWE-bench_Verified (split=test, slice 0:100), mini-SWE-agent scaffold, grammar-constrained (grammar-ON) single-action output format, thinking ON, temp 0, 3072 max output tokens, 75-step limit. Score = fraction of instances resolved.

Model Resolved (N=100)
Maniac 2-bit (this repo) 0.39 (39/100)
Unsloth UD-Q4_K_XL (4-bit) 0.30 (30/100)
Unsloth UD-Q2_K_XL (2-bit) 0.24 (24/100)

Serving-engine caveat: identical agent scaffold, grammar constraint, dataset slice, and decoding for all rows, but this repo's run was served via ik_llama while the Unsloth rows were served via mainline llama.cpp (both --jinja).

τ²-bench (multi-turn conversational tool-use, N=278)

Protocol: official τ²-bench (Sierra; v1.0.0, pinned commit 1746a25), driven through the official tau2.run.run_domain Python API — environments, tasks, and scoring are the upstream code, not a reimplementation. Full base task split across all three domains: airline (50) + telecom (114) + retail (114) = 278 tasks. Agent uses native structured tool_calls, thinking OFF, temp 0, single trial (headline = pass^1 = avg_reward); composite = unweighted macro-mean of per-domain avg_reward. The user simulator and the retail NL-assertions judge are one fixed model (gpt-5.4-mini, temp 0) held constant across every row, so only the agent model varies. That makes the rows internally comparable, but not directly comparable to the public τ²-bench leaderboard (which uses a gpt-4.1 user simulator). All four rows served on mainline llama.cpp llama-server --jinja — same engine, same harness.

Model airline telecom retail Macro avg (pass^1) ↑
Maniac 2-bit (this repo) 0.620 0.868 0.570 0.686
Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB) 0.620 0.772 0.561 0.651
Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB) 0.600 0.833 0.518 0.650
bf16 teacher (~66 GiB) 0.580 0.833 0.526 0.647

Takeaways:

  • Best row in the cohort on this harness — ahead of Unsloth's 2-bit (+3.6 macro points), Unsloth's 4-bit (+3.5), and even the bf16 teacher (+4.0). Single-trial (pass^1) numbers carry noise, so we'd summarize it conservatively as: the 2-bit quant shows no measurable regression on multi-turn agentic tool use.
  • Retail is the hardest domain for every row (it includes LLM-judged natural-language assertions); telecom (deterministic env/action scoring) is where this quant is strongest (0.868).

General benchmarks (this model)

Benchmark Score Protocol
IFEval 0.813 (prompt-strict) lm-eval-harness, full 541, thinking ON
HumanEval 0.951 (pass@1) greedy
MuSR 0.591 lm-eval-harness
MMLU-Redux ~0.842 subset N=741

These are measured under slightly different serve settings than the Unsloth lane (noted protocols), so we don't present them as a head-to-head table — the same-harness comparisons above are BFCL, SWE-bench, and τ²-bench.


What's in the file

Only mainline ggml quant types — all of this is inspectable in the GGUF metadata:

Tensor group Format bpw
Expert gate / up projections IQ2_XXS 2.0625
Expert down projections Q2_K ~2.6–3.0
Non-expert weights (attention, norms, embeddings, output) Q2_K ~2.6–3.0

Quantized directly from the Qwen/Qwen3.6-35B-A3B bf16 release weights with imatrix calibration using llama.cpp/ik_llama tooling. Effective ~2.1 bpw, 11.8 GiB on disk.


Model details

Property Value
Architecture qwen3next — hybrid attention + linear-attention (Gated DeltaNet) sparse MoE
Parameters 35B total / ~3B active per token
Experts 256 routed, top-8 + 1 shared
Context 262,144 tokens native
File qwen3.6-35b-a3b-iq2xxs-q2k.gguf (11.8 GiB)
License Apache 2.0 (inherited from base model)

Limitations

  • It's still 2-bit. Composite agentic quality sits ~0.044 below the bf16 teacher and ~0.036 below a good 4-bit quant. If you have the RAM for 4-bit, use 4-bit.
  • Abstention is the main regression. BFCL irrelevance is 0.746 vs the teacher's 0.904: the 2-bit model is more likely to attempt a tool call when it should decline. If your agent has destructive tools, gate them.
  • Run tool-use with thinking OFF. With thinking ON the model may emit reasoning tokens before/around tool calls; all function-calling numbers here are thinking-OFF (--reasoning-budget 0).
  • qwen3next engine support is still maturing. Mainline llama.cpp (recent builds) serves this model cleanly on CUDA and Metal. On ik_llama's Metal backend, batched prefill on this architecture can be unstable — prefer single-stream (--parallel 1) or mainline llama.cpp on Macs.

Provenance & license

If you use this model, please credit both Qwen (base model) and Maniac (quantization).

@misc{maniac-qwen3.6-35b-a3b-2bit,
  title        = {Maniac Qwen3.6-35B-A3B 2-bit GGUF (IQ2\_XXS + Q2\_K)},
  author       = {Maniac},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ManiacLabs/Qwen3.6-35B-A3B-2bit}},
  note         = {Imatrix-calibrated 2-bit GGUF of Qwen3.6-35B-A3B. BFCL composite 0.6338.}
}
Downloads last month
323
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManiacLabs/Qwen3.6-35B-A3B-2bit

Quantized
(473)
this model