Instructions to use natfii/Qwen3.6-27B-VLM-Cascade with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use natfii/Qwen3.6-27B-VLM-Cascade with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="natfii/Qwen3.6-27B-VLM-Cascade") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("natfii/Qwen3.6-27B-VLM-Cascade") model = AutoModelForMultimodalLM.from_pretrained("natfii/Qwen3.6-27B-VLM-Cascade") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use natfii/Qwen3.6-27B-VLM-Cascade with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "natfii/Qwen3.6-27B-VLM-Cascade" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "natfii/Qwen3.6-27B-VLM-Cascade", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/natfii/Qwen3.6-27B-VLM-Cascade
- SGLang
How to use natfii/Qwen3.6-27B-VLM-Cascade with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "natfii/Qwen3.6-27B-VLM-Cascade" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "natfii/Qwen3.6-27B-VLM-Cascade", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "natfii/Qwen3.6-27B-VLM-Cascade" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "natfii/Qwen3.6-27B-VLM-Cascade", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use natfii/Qwen3.6-27B-VLM-Cascade with Docker Model Runner:
docker model run hf.co/natfii/Qwen3.6-27B-VLM-Cascade
Qwen3.6-27B-VLM-Cascade (BF16)
A <think>-style reasoning vision-language model: Qwen/Qwen3.6-27B (VLM)
post-trained with a Cascade-style recipe (reasoning SFT cold-start →
sequential, domain-wise RLVR + MOPD on-policy self-distillation), after the
method in nvidia/Nemotron-Cascade-2-30B-A3B
(arXiv 2603.19220). This is the full-precision BF16 master: the
re-quantizable source of truth. It carries a 1-layer qwen3_5_mtp draft head
(verbatim base head, kept BF16) for NEXTN speculative decoding.
The two-repo pattern
| Repo | Artifact | For |
|---|---|---|
natfii/Qwen3.6-27B-VLM-Cascade (this one) |
BF16 master + base mtp.* draft head |
Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher |
natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP |
NVFP4 body + BF16 lm_head + BF16 MTP head |
Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode) |
Lineage
| Base | Qwen/Qwen3.6-27B (VLM, image-text-to-text), apache-2.0 |
| Post-training | Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen |
| Precision | BF16 throughout (this is the master; not quantized) |
| MTP draft head | 1-layer qwen3_5_mtp head (verbatim base head, kept BF16) |
Architecture (from config.json)
- 27B params, hybrid attention: 16 full-attention + 48 linear-attention
layers (
full_attention_interval=4),hidden_size=5120,num_hidden_layers=64. Thelayer_typeslist places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent). - Full attention: 24 query / 4 KV heads,
head_dim=256(GQA). - Vision tower (
model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it. - MTP: 1 draft-head layer (
mtp_num_hidden_layers=1,mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuseslm_head. Here the head is the verbatim base draft head, kept BF16. vocab_size=248320.
The MTP head
This repo ships the verbatim base qwen3_5_mtp draft head — the original
1-layer head, kept BF16, grafted additively onto the post-trained body for NEXTN
speculative decoding. Spec-decode is lossless (the draft head only affects
decode speed, never the output), so the base head is a safe default; re-measure
accepted length on your serving stack, and optionally re-align the head to this
target if you want higher acceptance.
Fusion: the head uses single-final-hidden NEXTN (
--fusion final), not EAGLE-3 multi-layer fusion.
Reasoning modes
ChatML with toggleable thinking, à la Cascade. Thinking is off by default — when
a request does not set enable_thinking, the template emits an empty <think></think>
and the model answers directly.
- Instruct (default): adjacent empty
<think></think>; no visible reasoning trace. - Thinking (opt-in): pass
chat_template_kwargs={"enable_thinking": true}(or put<|think_on|>in the system message); generation then begins<think>\nand the model reasons before answering.<|think_off|>/enable_thinking=falseforces it off. - Termination handoff (thinking mode only): the template appends a brief reasoning→answer
instruction to the system prompt (reason fully, verify, then close
</think>and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.
This model reasons at length, so enabling thinking under a small max_tokens can
return an only-reasoning, truncated reply — budget the completion accordingly. When serving via
vLLM or SGLang you can hard-cap the thinking: vLLM thinking_token_budget=N (needs
--reasoning-parser qwen3), or SGLang --enable-strict-thinking + custom_params={"thinking_budget": N},
force-close </think> after N reasoning tokens — set it generously (~3000–4000; genuine hard
problems use ~2800) so it only catches runaway loops.
Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy
(temperature=0 loops; at 1.0 it rambles — the paper's 1.0 is for avg@k eval only). The
repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking
mode — it lets the model close </think> and answer (clean termination, no measured accuracy
loss); lowering temperature does not help (it deepens the loop).
To split the <think> trace into a separate reasoning channel, use your runtime's qwen3
reasoning parser (the separated trace is message.reasoning on vLLM 0.22.0, reasoning_content
on SGLang).
Usage (BF16, transformers)
# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText
# with trust_remote_code is the portable fallback.
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "natfii/Qwen3.6-27B-VLM-Cascade"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True
)
# Thinking is OFF by default (empty "<think></think>"); pass
# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.
Spec-decode / NEXTN: the BF16 mtp.* head is present and aligned to this
BF16 target, so runtimes that support the qwen3_5_mtp / NEXTN draft method can
speculate directly against this repo. (For a turnkey, memory-bandwidth-friendly
GB10 deployment, prefer the NVFP4-MTP repo.)
Re-quantizing this master (e.g. to NVFP4 for GB10)
This BF16 master is the source the NVFP4-MTP deployment build is made from. To
reproduce that build, re-quant with nvidia-modelopt and keep the
BF16-head invariant ignore-list byte-for-byte (pipeline S4): exclude
*model.visual*, *linear_attn.conv1d*, *lm_head*, and *mtp*
from NVFP4 (note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized —
re-verify in_proj against hf_quant_config.json at S4 build), and keep the
KV-cache FP8 setting identical. Keeping the output and
draft heads out of FP4 is what protects both answer quality and speculative
acceptance. Graft the mtp.* head into the quantized export (kept BF16, out of the
FP4 body); the base head transfers, but re-measure accepted length and optionally
re-align it to the quantized target for higher acceptance.
License, attribution & data provenance
License — Apache-2.0. This model is a derivative of
Qwen/Qwen3.6-27B (released under
Apache-2.0) and is itself published under Apache-2.0. You may use it
commercially or non-commercially, provided you retain the LICENSE and NOTICE
files and the attributions below.
Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.
Attribution.
- Base model
Qwen/Qwen3.6-27B© Alibaba Cloud / the Qwen team — Apache-2.0. - Cascade-style post-training, MTP-head graft + re-align, and packaging by
natfii. - Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.
Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.
| Stage | Dataset(s) | License |
|---|---|---|
SFT cold-start (~10k <think> traces; ~6k math + ~4k code) |
open-thoughts/OpenThoughts-114k + open-r1/OpenR1-Math-220k |
Apache-2.0 (both) |
| Math RLVR prompts | nvidia/AceReason-Math (← NuminaMath-1.5 + DeepScaleR-Preview) |
CC-BY-4.0 |
| IF-RL / MOPD / multi-domain prompts + verifiers | nvidia/Nemotron-Cascade-2-RL-data |
ODC-BY-1.0 |
| MOPD + MTP-head self-distillation | the model's own frozen checkpoint (no third-party teacher) | — |
The SFT traces are DeepSeek-R1-distilled (via the two open datasets above);
DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets
relicense their traces under Apache-2.0 — disclosed for transparency; no extra
obligation attaches. Full attributions are reproduced in the repo NOTICE file.
Intended use & limitations
- Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
- Not production-evaluated beyond the light benchmark above — validate for your use case.
- Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
- MTP acceptance is empirical: the draft head is the verbatim base head, so
accepted-length should be re-measured on your serving stack (fusion-index is
RESOLVED: single-final-hidden NEXTN,
--fusion final). - Inherits all base-model limitations (hallucination, bias, knowledge cutoff).
Evaluation
Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.
Provenance
Cascade-style post-training, MTP-head graft, and packaging by natfii
via the qwen-cascade pipeline (single GB10 / DGX Spark,
SM121). The NVFP4-MTP deployment repo is re-quantized from this master with the
BF16-head invariant.
- Downloads last month
- 77
Model tree for natfii/Qwen3.6-27B-VLM-Cascade
Base model
Qwen/Qwen3.6-27B