Instructions to use natfii/Qwen3.6-27B-VLM-Cascade with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use natfii/Qwen3.6-27B-VLM-Cascade with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="natfii/Qwen3.6-27B-VLM-Cascade")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("natfii/Qwen3.6-27B-VLM-Cascade")
model = AutoModelForMultimodalLM.from_pretrained("natfii/Qwen3.6-27B-VLM-Cascade")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use natfii/Qwen3.6-27B-VLM-Cascade with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "natfii/Qwen3.6-27B-VLM-Cascade"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "natfii/Qwen3.6-27B-VLM-Cascade",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/natfii/Qwen3.6-27B-VLM-Cascade

SGLang

How to use natfii/Qwen3.6-27B-VLM-Cascade with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "natfii/Qwen3.6-27B-VLM-Cascade" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "natfii/Qwen3.6-27B-VLM-Cascade",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "natfii/Qwen3.6-27B-VLM-Cascade" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "natfii/Qwen3.6-27B-VLM-Cascade",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use natfii/Qwen3.6-27B-VLM-Cascade with Docker Model Runner:
```
docker model run hf.co/natfii/Qwen3.6-27B-VLM-Cascade
```

Qwen3.6-27B-VLM-Cascade (BF16)

A <think>-style reasoning vision-language model: Qwen/Qwen3.6-27B (VLM) post-trained with a Cascade-style recipe (reasoning SFT cold-start → sequential, domain-wise RLVR + MOPD on-policy self-distillation), after the method in nvidia/Nemotron-Cascade-2-30B-A3B (arXiv 2603.19220). This is the full-precision BF16 master: the re-quantizable source of truth. It carries a 1-layer qwen3_5_mtp draft head (verbatim base head, kept BF16) for NEXTN speculative decoding.

The two-repo pattern

Repo	Artifact	For
`natfii/Qwen3.6-27B-VLM-Cascade` (this one)	BF16 master + base `mtp.*` draft head	Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
`natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP`	NVFP4 body + BF16 `lm_head` + BF16 MTP head	Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

Lineage


Base	`Qwen/Qwen3.6-27B` (VLM, image-text-to-text), apache-2.0
Post-training	Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen
Precision	BF16 throughout (this is the master; not quantized)
MTP draft head	1-layer `qwen3_5_mtp` head (verbatim base head, kept BF16)

Architecture (from `config.json`)

27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64. The layer_types list places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent).
Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
Vision tower (model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it.
MTP: 1 draft-head layer (mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuses lm_head. Here the head is the verbatim base draft head, kept BF16.
vocab_size=248320.

The MTP head

This repo ships the verbatim base qwen3_5_mtp draft head — the original 1-layer head, kept BF16, grafted additively onto the post-trained body for NEXTN speculative decoding. Spec-decode is lossless (the draft head only affects decode speed, never the output), so the base head is a safe default; re-measure accepted length on your serving stack, and optionally re-align the head to this target if you want higher acceptance.

Fusion: the head uses single-final-hidden NEXTN (--fusion final), not EAGLE-3 multi-layer fusion.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — when a request does not set enable_thinking, the template emits an empty <think></think> and the model answers directly.

Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or put <|think_on|> in the system message); generation then begins <think>\n and the model reasons before answering. <|think_off|> / enable_thinking=false forces it off.
Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.

This model reasons at length, so enabling thinking under a small max_tokens can return an only-reasoning, truncated reply — budget the completion accordingly. When serving via vLLM or SGLang you can hard-cap the thinking: vLLM thinking_token_budget=N (needs --reasoning-parser qwen3), or SGLang --enable-strict-thinking + custom_params={"thinking_budget": N}, force-close </think> after N reasoning tokens — set it generously (~3000–4000; genuine hard problems use ~2800) so it only catches runaway loops.

Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy (temperature=0 loops; at 1.0 it rambles — the paper's 1.0 is for avg@k eval only). The repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking mode — it lets the model close </think> and answer (clean termination, no measured accuracy loss); lowering temperature does not help (it deepens the loop). To split the <think> trace into a separate reasoning channel, use your runtime's qwen3 reasoning parser (the separated trace is message.reasoning on vLLM 0.22.0, reasoning_content on SGLang).

Usage (BF16, transformers)

# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText
# with trust_remote_code is the portable fallback.
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "natfii/Qwen3.6-27B-VLM-Cascade"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True
)
# Thinking is OFF by default (empty "<think></think>"); pass
# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.

Spec-decode / NEXTN: the BF16 mtp.* head is present and aligned to this BF16 target, so runtimes that support the qwen3_5_mtp / NEXTN draft method can speculate directly against this repo. (For a turnkey, memory-bandwidth-friendly GB10 deployment, prefer the NVFP4-MTP repo.)

Re-quantizing this master (e.g. to NVFP4 for GB10)

This BF16 master is the source the NVFP4-MTP deployment build is made from. To reproduce that build, re-quant with nvidia-modelopt and keep the BF16-head invariant ignore-list byte-for-byte (pipeline S4): exclude *model.visual*, *linear_attn.conv1d*, *lm_head*, and *mtp* from NVFP4 (note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized — re-verify in_proj against hf_quant_config.json at S4 build), and keep the KV-cache FP8 setting identical. Keeping the output and draft heads out of FP4 is what protects both answer quality and speculative acceptance. Graft the mtp.* head into the quantized export (kept BF16, out of the FP4 body); the base head transfers, but re-measure accepted length and optionally re-align it to the quantized target for higher acceptance.

License, attribution & data provenance

License — Apache-2.0. This model is a derivative of Qwen/Qwen3.6-27B (released under Apache-2.0) and is itself published under Apache-2.0. You may use it commercially or non-commercially, provided you retain the LICENSE and NOTICE files and the attributions below.

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

Base model Qwen/Qwen3.6-27B © Alibaba Cloud / the Qwen team — Apache-2.0.
Cascade-style post-training, MTP-head graft + re-align, and packaging by natfii.
Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.

Stage	Dataset(s)	License
SFT cold-start (~10k `<think>` traces; ~6k math + ~4k code)	`open-thoughts/OpenThoughts-114k` + `open-r1/OpenR1-Math-220k`	Apache-2.0 (both)
Math RLVR prompts	`nvidia/AceReason-Math` (← NuminaMath-1.5 + DeepScaleR-Preview)	CC-BY-4.0
IF-RL / MOPD / multi-domain prompts + verifiers	`nvidia/Nemotron-Cascade-2-RL-data`	ODC-BY-1.0
MOPD + MTP-head self-distillation	the model's own frozen checkpoint (no third-party teacher)	—

The SFT traces are DeepSeek-R1-distilled (via the two open datasets above); DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets relicense their traces under Apache-2.0 — disclosed for transparency; no extra obligation attaches. Full attributions are reproduced in the repo NOTICE file.

Intended use & limitations

Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
Not production-evaluated beyond the light benchmark above — validate for your use case.
Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
MTP acceptance is empirical: the draft head is the verbatim base head, so accepted-length should be re-measured on your serving stack (fusion-index is RESOLVED: single-final-hidden NEXTN, --fusion final).
Inherits all base-model limitations (hallucination, bias, knowledge cutoff).

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Provenance

Cascade-style post-training, MTP-head graft, and packaging by natfii via the qwen-cascade pipeline (single GB10 / DGX Spark, SM121). The NVFP4-MTP deployment repo is re-quantized from this master with the BF16-head invariant.

Downloads last month: 77

Safetensors

Model size

28B params

Tensor type

BF16

Model tree for natfii/Qwen3.6-27B-VLM-Cascade

Base model

Qwen/Qwen3.6-27B

Finetuned

(236)

this model