Qwen3.6-35B-A3B — INT4 AutoRound Quantization

4-bit quantization of Qwen/Qwen3.6-35B-A3B produced with spark-auto-round.

Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35B total parameters and ~3B active parameters per forward pass (256 experts, 8 active). It features a hybrid attention architecture (linear + full attention every 4 layers) and a 262K token context window.

Quantization Details

Parameter Value
Method AutoRound
AutoRound version 0.14.1
Bits 4 (int)
Group size 128
Symmetric Yes
Packing format auto_round:auto_gptq
Calibration dataset opencode-instruct
Calibration samples 512
Sequence length 2048
Iterations 1000

MLP gate layers and shared expert gate layers are kept in FP16 to preserve routing quality.

Quality Report

Quantized with AutoRound's sensitivity-based optimization. All 40 transformer blocks were evaluated:

Status Count
Pass (cosine sim ≥ 0.99) 27
Warning (cosine sim 0.98–0.99) 13

All layers maintain cosine similarity > 0.98 vs the original. Warnings are concentrated in the deeper layers (23–37), which is typical for MoE models at 4-bit.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, top_k=20, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model Architecture

  • Architecture: Qwen3.5 MoE (hybrid linear + full attention)
  • Total parameters: ~35B
  • Active parameters: ~3B per token
  • Experts: 256 total, 8 active per token
  • Layers: 40 (linear attention every 3 layers, full attention every 4th)
  • Context length: 262,144 tokens
  • Vocabulary: 248,320 tokens

Hardware Requirements

The quantized model requires approximately ~19.5 GB of VRAM/RAM. A single 24 GB GPU (e.g., RTX 3090/4090) or two 12 GB GPUs with device_map="auto" are sufficient.

Quantization Command

auto-round \
  --model Qwen/Qwen3.6-35B-A3B \
  --batch_size 8 \
  --iters 1000 \
  --nsamples 512 \
  --seqlen 2048 \
  --dataset opencode-instruct \
  --output_dir ./models/Qwen3.6-35B-A3B-int4-AutoRound

Credits

Downloads last month
173
Safetensors
Model size
1B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyburn/Qwen3.6-35B-A3B-int4-AutoRound

Quantized
(526)
this model