Instructions to use cyburn/Qwen3.6-35B-A3B-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyburn/Qwen3.6-35B-A3B-int4-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cyburn/Qwen3.6-35B-A3B-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("cyburn/Qwen3.6-35B-A3B-int4-AutoRound")
model = AutoModelForMultimodalLM.from_pretrained("cyburn/Qwen3.6-35B-A3B-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cyburn/Qwen3.6-35B-A3B-int4-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyburn/Qwen3.6-35B-A3B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cyburn/Qwen3.6-35B-A3B-int4-AutoRound

SGLang

How to use cyburn/Qwen3.6-35B-A3B-int4-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cyburn/Qwen3.6-35B-A3B-int4-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyburn/Qwen3.6-35B-A3B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cyburn/Qwen3.6-35B-A3B-int4-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyburn/Qwen3.6-35B-A3B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cyburn/Qwen3.6-35B-A3B-int4-AutoRound with Docker Model Runner:
```
docker model run hf.co/cyburn/Qwen3.6-35B-A3B-int4-AutoRound
```

Qwen3.6-35B-A3B — INT4 AutoRound Quantization

4-bit quantization of Qwen/Qwen3.6-35B-A3B produced with spark-auto-round.

Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35B total parameters and ~3B active parameters per forward pass (256 experts, 8 active). It features a hybrid attention architecture (linear + full attention every 4 layers) and a 262K token context window.

Quantization Details

Parameter	Value
Method	AutoRound
AutoRound version	0.14.1
Bits	4 (int)
Group size	128
Symmetric	Yes
Packing format	auto_round:auto_gptq
Calibration dataset	opencode-instruct
Calibration samples	512
Sequence length	2048
Iterations	1000

MLP gate layers and shared expert gate layers are kept in FP16 to preserve routing quality.

Quality Report

Quantized with AutoRound's sensitivity-based optimization. All 40 transformer blocks were evaluated:

Status	Count
Pass (cosine sim ≥ 0.99)	27
Warning (cosine sim 0.98–0.99)	13

All layers maintain cosine similarity > 0.98 vs the original. Warnings are concentrated in the deeper layers (23–37), which is typical for MoE models at 4-bit.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, top_k=20, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model Architecture

Architecture: Qwen3.5 MoE (hybrid linear + full attention)
Total parameters: ~35B
Active parameters: ~3B per token
Experts: 256 total, 8 active per token
Layers: 40 (linear attention every 3 layers, full attention every 4th)
Context length: 262,144 tokens
Vocabulary: 248,320 tokens

Hardware Requirements

The quantized model requires approximately ~19.5 GB of VRAM/RAM. A single 24 GB GPU (e.g., RTX 3090/4090) or two 12 GB GPUs with device_map="auto" are sufficient.

Quantization Command

auto-round \
  --model Qwen/Qwen3.6-35B-A3B \
  --batch_size 8 \
  --iters 1000 \
  --nsamples 512 \
  --seqlen 2048 \
  --dataset opencode-instruct \
  --output_dir ./models/Qwen3.6-35B-A3B-int4-AutoRound

Credits

Base model: Qwen/Qwen3.6-35B-A3B by Alibaba Cloud
Quantization tool: spark-auto-round — a GB10-optimized fork of Intel's auto-round, tuned for DGX Spark / GB10 unified memory hardware

Downloads last month: 173

Safetensors

Model size

1B params

Tensor type

I32

BF16

F16

Model tree for cyburn/Qwen3.6-35B-A3B-int4-AutoRound

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(526)

this model