Qwen3.6-27B-VLM-Cascade (BF16)

A <think>-style reasoning vision-language model: Qwen/Qwen3.6-27B (VLM) post-trained with a Cascade-style recipe (reasoning SFT cold-start → sequential, domain-wise RLVR + MOPD on-policy self-distillation), after the method in nvidia/Nemotron-Cascade-2-30B-A3B (arXiv 2603.19220). This is the full-precision BF16 master: the re-quantizable source of truth. It carries a 1-layer qwen3_5_mtp draft head (verbatim base head, kept BF16) for NEXTN speculative decoding.

The two-repo pattern

Repo Artifact For
natfii/Qwen3.6-27B-VLM-Cascade (this one) BF16 master + base mtp.* draft head Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP NVFP4 body + BF16 lm_head + BF16 MTP head Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

Lineage

Base Qwen/Qwen3.6-27B (VLM, image-text-to-text), apache-2.0
Post-training Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen
Precision BF16 throughout (this is the master; not quantized)
MTP draft head 1-layer qwen3_5_mtp head (verbatim base head, kept BF16)

Architecture (from config.json)

  • 27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64. The layer_types list places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent).
  • Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
  • Vision tower (model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it.
  • MTP: 1 draft-head layer (mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuses lm_head. Here the head is the verbatim base draft head, kept BF16.
  • vocab_size=248320.

The MTP head

This repo ships the verbatim base qwen3_5_mtp draft head — the original 1-layer head, kept BF16, grafted additively onto the post-trained body for NEXTN speculative decoding. Spec-decode is lossless (the draft head only affects decode speed, never the output), so the base head is a safe default; re-measure accepted length on your serving stack, and optionally re-align the head to this target if you want higher acceptance.

Fusion: the head uses single-final-hidden NEXTN (--fusion final), not EAGLE-3 multi-layer fusion.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — when a request does not set enable_thinking, the template emits an empty <think></think> and the model answers directly.

  • Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
  • Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or put <|think_on|> in the system message); generation then begins <think>\n and the model reasons before answering. <|think_off|> / enable_thinking=false forces it off.
  • Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.

This model reasons at length, so enabling thinking under a small max_tokens can return an only-reasoning, truncated reply — budget the completion accordingly. When serving via vLLM or SGLang you can hard-cap the thinking: vLLM thinking_token_budget=N (needs --reasoning-parser qwen3), or SGLang --enable-strict-thinking + custom_params={"thinking_budget": N}, force-close </think> after N reasoning tokens — set it generously (~3000–4000; genuine hard problems use ~2800) so it only catches runaway loops.

Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy (temperature=0 loops; at 1.0 it rambles — the paper's 1.0 is for avg@k eval only). The repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking mode — it lets the model close </think> and answer (clean termination, no measured accuracy loss); lowering temperature does not help (it deepens the loop). To split the <think> trace into a separate reasoning channel, use your runtime's qwen3 reasoning parser (the separated trace is message.reasoning on vLLM 0.22.0, reasoning_content on SGLang).

Usage (BF16, transformers)

# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText
# with trust_remote_code is the portable fallback.
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "natfii/Qwen3.6-27B-VLM-Cascade"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True
)
# Thinking is OFF by default (empty "<think></think>"); pass
# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.

Spec-decode / NEXTN: the BF16 mtp.* head is present and aligned to this BF16 target, so runtimes that support the qwen3_5_mtp / NEXTN draft method can speculate directly against this repo. (For a turnkey, memory-bandwidth-friendly GB10 deployment, prefer the NVFP4-MTP repo.)

Re-quantizing this master (e.g. to NVFP4 for GB10)

This BF16 master is the source the NVFP4-MTP deployment build is made from. To reproduce that build, re-quant with nvidia-modelopt and keep the BF16-head invariant ignore-list byte-for-byte (pipeline S4): exclude *model.visual*, *linear_attn.conv1d*, *lm_head*, and *mtp* from NVFP4 (note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized — re-verify in_proj against hf_quant_config.json at S4 build), and keep the KV-cache FP8 setting identical. Keeping the output and draft heads out of FP4 is what protects both answer quality and speculative acceptance. Graft the mtp.* head into the quantized export (kept BF16, out of the FP4 body); the base head transfers, but re-measure accepted length and optionally re-align it to the quantized target for higher acceptance.

License, attribution & data provenance

License — Apache-2.0. This model is a derivative of Qwen/Qwen3.6-27B (released under Apache-2.0) and is itself published under Apache-2.0. You may use it commercially or non-commercially, provided you retain the LICENSE and NOTICE files and the attributions below.

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

  • Base model Qwen/Qwen3.6-27B © Alibaba Cloud / the Qwen team — Apache-2.0.
  • Cascade-style post-training, MTP-head graft + re-align, and packaging by natfii.
  • Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.

Stage Dataset(s) License
SFT cold-start (~10k <think> traces; ~6k math + ~4k code) open-thoughts/OpenThoughts-114k + open-r1/OpenR1-Math-220k Apache-2.0 (both)
Math RLVR prompts nvidia/AceReason-Math (← NuminaMath-1.5 + DeepScaleR-Preview) CC-BY-4.0
IF-RL / MOPD / multi-domain prompts + verifiers nvidia/Nemotron-Cascade-2-RL-data ODC-BY-1.0
MOPD + MTP-head self-distillation the model's own frozen checkpoint (no third-party teacher)

The SFT traces are DeepSeek-R1-distilled (via the two open datasets above); DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets relicense their traces under Apache-2.0 — disclosed for transparency; no extra obligation attaches. Full attributions are reproduced in the repo NOTICE file.

Intended use & limitations

  • Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
  • Not production-evaluated beyond the light benchmark above — validate for your use case.
  • Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
  • MTP acceptance is empirical: the draft head is the verbatim base head, so accepted-length should be re-measured on your serving stack (fusion-index is RESOLVED: single-final-hidden NEXTN, --fusion final).
  • Inherits all base-model limitations (hallucination, bias, knowledge cutoff).

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Provenance

Cascade-style post-training, MTP-head graft, and packaging by natfii via the qwen-cascade pipeline (single GB10 / DGX Spark, SM121). The NVFP4-MTP deployment repo is re-quantized from this master with the BF16-head invariant.

Downloads last month
77
Safetensors
Model size
28B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for natfii/Qwen3.6-27B-VLM-Cascade

Base model

Qwen/Qwen3.6-27B
Finetuned
(236)
this model