Instructions to use ManiacLabs/Qwen3.6-35B-A3B-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManiacLabs/Qwen3.6-35B-A3B-2bit", filename="qwen3.6-35b-a3b-iq2xxs-q2k.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit # Run inference directly in the terminal: llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit # Run inference directly in the terminal: llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit # Run inference directly in the terminal: ./llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit # Run inference directly in the terminal: ./build/bin/llama-cli -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Use Docker
docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
- LM Studio
- Jan
- vLLM
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ManiacLabs/Qwen3.6-35B-A3B-2bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManiacLabs/Qwen3.6-35B-A3B-2bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
- Ollama
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Ollama:
ollama run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
- Unsloth Studio
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ManiacLabs/Qwen3.6-35B-A3B-2bit to start chatting
- Pi
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ManiacLabs/Qwen3.6-35B-A3B-2bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManiacLabs/Qwen3.6-35B-A3B-2bit
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ManiacLabs/Qwen3.6-35B-A3B-2bit
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Docker Model Runner:
docker model run hf.co/ManiacLabs/Qwen3.6-35B-A3B-2bit
- Lemonade
How to use ManiacLabs/Qwen3.6-35B-A3B-2bit with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ManiacLabs/Qwen3.6-35B-A3B-2bit
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-2bit-{{QUANT_TAG}}List all available models
lemonade list
Qwen3.6-35B-A3B 2-bit GGUF — teacher-level agentic tool use at ~12 GiB
The strongest 2-bit quant of Qwen/Qwen3.6-35B-A3B we know of for agentic work — and not just on one benchmark. On τ²-bench (multi-turn agentic tool use) it matches or beats every model in our cohort, including the bf16 teacher itself, and it beats Unsloth's same-size 2-bit across all three agentic benchmarks we measured: BFCL, τ²-bench, and SWE-bench Verified. One standard GGUF file, mainline-llama.cpp native — runs on a 16 GB Mac or a 12 GB GPU.
| Quant | File size | BFCL composite ↑ | τ²-bench macro ↑ | SWE-bench Verified ↑ |
|---|---|---|---|---|
| Maniac 2-bit (this repo, IQ2_XXS+Q2_K) | 11.8 GiB | 0.6338 | 0.686 | 0.39 |
| Unsloth UD-Q2_K_XL (2-bit) | 11.45 GiB | 0.600 | 0.650 | 0.24 |
| Unsloth UD-Q4_K_XL (4-bit) | 20.82 GiB | 0.6696 | 0.651 | 0.30 |
| bf16 teacher (reference ceiling) | ~66 GiB | 0.6777 | 0.647 | — |
Every column is measured by us with the identical harness and settings for all rows in that column (full protocols below): BFCL via mainline llama.cpp + bfcl-eval, full-N, thinking OFF, temp 0; τ²-bench via the official Sierra harness, all 278 base tasks, fixed user-simulator model; SWE-bench Verified via mini-SWE-agent on the first-100 slice (our row served via ik_llama, Unsloth rows via mainline llama.cpp — same scaffold and decoding). Honest caveats: τ² is single-trial (pass^1), so we read it conservatively as "no measurable regression vs the teacher on multi-turn tool use" rather than a strict win; and on absolute BFCL, Unsloth's 4-bit is still better (0.6696 vs 0.6338) at 1.8× the size — this repo wins on quality-per-gigabyte and on fitting in 12–16 GB.
Maniac engine format: Qwen3.6-35B-A3B-2bit-maniac — the same quant repackaged for on-device serving in the Maniac app.
Download
One file, no splits, no sidecars:
| Filename | Quant type | File size | Description |
|---|---|---|---|
| qwen3.6-35b-a3b-iq2xxs-q2k.gguf | IQ2_XXS / Q2_K mix (~2.1 bpw) | 11.8 GiB | Imatrix-calibrated 2-bit, tuned for tool-calling / agentic use. Fully resident in 12 GiB VRAM or a 16 GB Mac. Recommended. |
pip install -U "huggingface_hub[cli]"
huggingface-cli download ManiacLabs/Qwen3.6-35B-A3B-2bit qwen3.6-35b-a3b-iq2xxs-q2k.gguf --local-dir ./
Which file should I choose? There is only one — that's the point. If you have 12–16 GB of RAM/VRAM and want the best agentic 2-bit Qwen3.6-35B, this is it. If you have 24 GB+, a good 4-bit quant (e.g. Unsloth UD-Q4_K_XL) is absolutely stronger and worth the extra 9 GiB.
Quickstart
This is a standard GGUF using only mainline ggml quant types (IQ2_XXS, Q2_K) — it loads in stock llama.cpp, ik_llama, LM Studio, and any other llama.cpp-based app. No custom build needed.
llama.cpp (OpenAI-compatible server)
# Tool-use / function-calling (recommended: thinking OFF)
llama-server \
--model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
--jinja \
--reasoning-budget 0 \
-fa on \
-ngl 99
# One-shot CLI
llama-cli \
--model qwen3.6-35b-a3b-iq2xxs-q2k.gguf \
--jinja --reasoning-budget 0 \
-p "Write a Python function to compute Fibonacci numbers."
Thinking mode
Qwen3.6 is a hybrid thinking model and defaults to thinking ON:
- Tool-use / agents: run thinking OFF (
--reasoning-budget 0, orenable_thinking=falsein the chat template). All BFCL numbers on this card are measured thinking-OFF. - Hard reasoning tasks: leave thinking ON (omit
--reasoning-budget 0).
Recommended sampling
- Tool-use / agents (matches our benchmark setup):
temperature=0(greedy) - General chat, non-thinking:
temperature=0.7, top_p=0.8, top_k=20(upstream Qwen guidance) - Thinking mode:
temperature=1.0, top_p=0.95, top_k=20(upstream Qwen guidance)
Maniac desktop app
This model is also available on-device via the Maniac app catalog — select Qwen3.6-35B-A3B 2-bit in the local models pane and the app handles download and serving. The app runs a separate format optimized for its own engine (Qwen3.6-35B-A3B-2bit-maniac), not this GGUF file.
Will it run on my machine?
Almost certainly, and not just on Macs: this is a standard GGUF that runs via llama.cpp/ik_llama on any Apple Silicon Mac with 16 GB+ unified memory (M1 and later), on Windows/Linux CPUs with ~13 GB of free RAM, and on any CUDA GPU with 13 GB+ VRAM — plus partial-offload setups below that.
| Hardware | Fits? | Notes |
|---|---|---|
| 16 GB Apple Silicon Mac (any M1 or later) | ✅ fully resident | The headline use case — a 35B-class agentic model on a 16 GB laptop |
| 12 GB GPU (RTX 3060 12GB and up) | ✅ fully resident | -ngl 99, full GPU offload |
| 8 GB GPU / RAM | ⚠️ partial offload | Works with CPU offload, slower |
| 24 GB+ GPU / 32 GB+ Mac | ✅ | Consider a 4-bit quant instead for max quality |
| Disk | 11.8 GiB | Single file |
Decode is fast for the size class: Qwen3.6-35B-A3B is a sparse MoE with only ~3B active parameters per token, so token generation costs roughly what a 3B dense model does at equal bandwidth.
Benchmarks
Everything in this section was measured by us, with the protocol stated inline. The headline claim (vs Unsloth) uses byte-identical harnesses and settings for every model in the table.
Function calling — BFCL (full-N, same harness for all rows)
Protocol: mainline ggml-org/llama.cpp llama-server --jinja, server-side structured tool_calls over /v1/chat/completions, scored with bfcl-eval 2025.10.27.1 (DeepSeek-V3.2-Exp-FC OpenAI-tools handler pointed at the local server). Full-N per category: 200 / 200 / 1053 / 240. Thinking OFF, temp 0 (greedy). Composite = mean of the four categories.
| Model | multi_turn_base | multi_turn_miss_param | live_multiple | irrelevance | Composite |
|---|---|---|---|---|---|
| bf16 teacher (ceiling) | 0.665 | 0.360 | 0.782 | 0.904 | 0.6777 |
| Maniac 2-bit (this repo) | 0.650 | 0.335 | 0.804 | 0.746 | 0.6338 |
| Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB) | 0.660 | 0.340 | 0.783 | 0.896 | 0.6696 |
| Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB) | 0.600 | 0.310 | 0.782 | 0.708 | 0.600 |
Takeaways:
- vs Unsloth 2-bit: ahead in every category (+5.0 mtb, +2.5 miss_param, +2.2 live_multiple, +3.8 irrelevance) at near-identical size.
- vs the bf16 teacher: within 0.044 composite at ~18% of the bytes;
live_multiple(0.804) actually exceeds the teacher (0.782). - vs Unsloth 4-bit: behind on absolute quality (−0.036), mostly on irrelevance/abstention — see Limitations. The win is quality-per-GiB, not absolute quality.
SWE-bench Verified (coding agent, N=100)
Protocol: first-100 slice of princeton-nlp/SWE-bench_Verified (split=test, slice 0:100), mini-SWE-agent scaffold, grammar-constrained (grammar-ON) single-action output format, thinking ON, temp 0, 3072 max output tokens, 75-step limit. Score = fraction of instances resolved.
| Model | Resolved (N=100) |
|---|---|
| Maniac 2-bit (this repo) | 0.39 (39/100) |
| Unsloth UD-Q4_K_XL (4-bit) | 0.30 (30/100) |
| Unsloth UD-Q2_K_XL (2-bit) | 0.24 (24/100) |
Serving-engine caveat: identical agent scaffold, grammar constraint, dataset slice, and decoding for all rows, but this repo's run was served via ik_llama while the Unsloth rows were served via mainline llama.cpp (both --jinja).
τ²-bench (multi-turn conversational tool-use, N=278)
Protocol: official τ²-bench (Sierra; v1.0.0, pinned commit 1746a25), driven through the official tau2.run.run_domain Python API — environments, tasks, and scoring are the upstream code, not a reimplementation. Full base task split across all three domains: airline (50) + telecom (114) + retail (114) = 278 tasks. Agent uses native structured tool_calls, thinking OFF, temp 0, single trial (headline = pass^1 = avg_reward); composite = unweighted macro-mean of per-domain avg_reward. The user simulator and the retail NL-assertions judge are one fixed model (gpt-5.4-mini, temp 0) held constant across every row, so only the agent model varies. That makes the rows internally comparable, but not directly comparable to the public τ²-bench leaderboard (which uses a gpt-4.1 user simulator). All four rows served on mainline llama.cpp llama-server --jinja — same engine, same harness.
| Model | airline | telecom | retail | Macro avg (pass^1) ↑ |
|---|---|---|---|---|
| Maniac 2-bit (this repo) | 0.620 | 0.868 | 0.570 | 0.686 |
| Unsloth UD-Q4_K_XL (4-bit, 20.82 GiB) | 0.620 | 0.772 | 0.561 | 0.651 |
| Unsloth UD-Q2_K_XL (2-bit, 11.45 GiB) | 0.600 | 0.833 | 0.518 | 0.650 |
| bf16 teacher (~66 GiB) | 0.580 | 0.833 | 0.526 | 0.647 |
Takeaways:
- Best row in the cohort on this harness — ahead of Unsloth's 2-bit (+3.6 macro points), Unsloth's 4-bit (+3.5), and even the bf16 teacher (+4.0). Single-trial (pass^1) numbers carry noise, so we'd summarize it conservatively as: the 2-bit quant shows no measurable regression on multi-turn agentic tool use.
- Retail is the hardest domain for every row (it includes LLM-judged natural-language assertions); telecom (deterministic env/action scoring) is where this quant is strongest (0.868).
General benchmarks (this model)
| Benchmark | Score | Protocol |
|---|---|---|
| IFEval | 0.813 (prompt-strict) | lm-eval-harness, full 541, thinking ON |
| HumanEval | 0.951 (pass@1) | greedy |
| MuSR | 0.591 | lm-eval-harness |
| MMLU-Redux | ~0.842 | subset N=741 |
These are measured under slightly different serve settings than the Unsloth lane (noted protocols), so we don't present them as a head-to-head table — the same-harness comparisons above are BFCL, SWE-bench, and τ²-bench.
What's in the file
Only mainline ggml quant types — all of this is inspectable in the GGUF metadata:
| Tensor group | Format | bpw |
|---|---|---|
| Expert gate / up projections | IQ2_XXS | 2.0625 |
| Expert down projections | Q2_K | ~2.6–3.0 |
| Non-expert weights (attention, norms, embeddings, output) | Q2_K | ~2.6–3.0 |
Quantized directly from the Qwen/Qwen3.6-35B-A3B bf16 release weights with imatrix calibration using llama.cpp/ik_llama tooling. Effective ~2.1 bpw, 11.8 GiB on disk.
Model details
| Property | Value |
|---|---|
| Architecture | qwen3next — hybrid attention + linear-attention (Gated DeltaNet) sparse MoE |
| Parameters | 35B total / ~3B active per token |
| Experts | 256 routed, top-8 + 1 shared |
| Context | 262,144 tokens native |
| File | qwen3.6-35b-a3b-iq2xxs-q2k.gguf (11.8 GiB) |
| License | Apache 2.0 (inherited from base model) |
Limitations
- It's still 2-bit. Composite agentic quality sits ~0.044 below the bf16 teacher and ~0.036 below a good 4-bit quant. If you have the RAM for 4-bit, use 4-bit.
- Abstention is the main regression. BFCL irrelevance is 0.746 vs the teacher's 0.904: the 2-bit model is more likely to attempt a tool call when it should decline. If your agent has destructive tools, gate them.
- Run tool-use with thinking OFF. With thinking ON the model may emit reasoning tokens before/around tool calls; all function-calling numbers here are thinking-OFF (
--reasoning-budget 0). qwen3nextengine support is still maturing. Mainlinellama.cpp(recent builds) serves this model cleanly on CUDA and Metal. Onik_llama's Metal backend, batched prefill on this architecture can be unstable — prefer single-stream (--parallel 1) or mainlinellama.cppon Macs.
Provenance & license
- Base model:
Qwen/Qwen3.6-35B-A3Bby the Qwen team. - Quantized by: Maniac (
ManiacLabs). - License: Apache 2.0, inherited from the base model.
If you use this model, please credit both Qwen (base model) and Maniac (quantization).
@misc{maniac-qwen3.6-35b-a3b-2bit,
title = {Maniac Qwen3.6-35B-A3B 2-bit GGUF (IQ2\_XXS + Q2\_K)},
author = {Maniac},
year = {2026},
howpublished = {\url{https://huggingface.co/ManiacLabs/Qwen3.6-35B-A3B-2bit}},
note = {Imatrix-calibrated 2-bit GGUF of Qwen3.6-35B-A3B. BFCL composite 0.6338.}
}
- Downloads last month
- 323
We're not able to determine the quantization variants.
Model tree for ManiacLabs/Qwen3.6-35B-A3B-2bit
Base model
Qwen/Qwen3.6-35B-A3B