TurboQwen3.6

Canonical artifact: Qwen3.6-27B-MTP-TQ3_4S

TurboQwen3.6

TurboQwen3.6 is the public release name for the TurboQuant GGUF build of the Qwen3.6 27B MTP model line.

The exact file and runtime artifact name remains:

  • Qwen3.6-27B-MTP-TQ3_4S.gguf

Parent Model

  • Upstream parent: unsloth/Qwen3.6-27B-MTP-GGUF
  • Format conversion and TurboQuant packaging: turbo-tan/llama.cpp-tq3

This release is intended for the public TurboQuant runtime fork:

  • https://github.com/turbo-tan/llama.cpp-tq3

It requires TQ3_4S runtime support and draft-MTP support. It is not expected to run correctly on stock llama.cpp builds that do not contain these extensions.

Matching Projector

The multimodal projector is published separately so the main Hugging Face page stays anchored on the 27B text model:

  • https://huggingface.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S-mmproj

Files

  • Qwen3.6-27B-MTP-TQ3_4S.gguf - main model, 13.39 GiB
  • mmproj.gguf - matching multimodal projector, 0.87 GiB, hosted in the separate projector repo above
  • thumbnail.png - model card image
  • benchmark.png - benchmark summary image

Recommended Runtime

Use flash attention at runtime and enable draft-MTP speculative decoding:

./build/bin/llama-server \
  -m Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --mmproj mmproj.gguf \
  --alias Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 32768 -np 1 -ngl 99 -fa on \
  -ctk q8_0 -ctv tq3_0 \
  --spec-type draft-mtp \
  --spec-draft-n-min 1 \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.0 \
  --reasoning off --jinja

Important build note:

  • -fa on above is the runtime flash-attention flag.
  • Do not confuse it with the CMake build flag GGML_CUDA_FA_ALL_QUANTS.
  • The validated fast release path uses runtime -fa on with GGML_CUDA_FA_ALL_QUANTS=OFF.

Quick Smoke Test

For a smaller local smoke, reduce context to 4096:

./build/bin/llama-server \
  -m Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --mmproj mmproj.gguf \
  --alias Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --host 127.0.0.1 --port 8096 \
  -c 4096 -np 1 -ngl 99 -fa on \
  -ctk q8_0 -ctv tq3_0 \
  --spec-type draft-mtp \
  --spec-draft-n-min 1 \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.0 \
  --reasoning off --jinja --no-warmup

Then:

curl -s http://127.0.0.1:8096/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen3.6-27B-MTP-TQ3_4S.gguf","messages":[{"role":"user","content":"Write ONLY the word ok."}],"max_tokens":32,"temperature":0}'

Expected assistant content:

ok

Benchmark Summary

Local BenchLoop comparison on RTX 3090, using draft-MTP and the runtime settings above:

Benchmark summary

Metric Result
Overall score 86.28
EasyCode 100.00%
Hard86 88.4%
Toolcall 96.67%
Data extract 90.97%
Instruct follow 76.67%
Reason math 73.33%
Generation speed 44.80 tok/s
Size 13.39 GiB

The packaged benchmark summary image is included in this repo as benchmark.png.

Notes

  • This is an MTP release. Use --spec-type draft-mtp with --spec-draft-n-max 2.
  • Use --spec-draft-p-min 0.0 on the current TurboQuant runtime.
  • Use -ctk q8_0 -ctv tq3_0 for the validated release profile.
  • If draft acceptance collapses to 0.00000 on long prompts, stop and check the runtime build and launch flags before benchmarking.

License

Use is subject to the base model license and the license terms of the runtime components used to run the GGUF.

Downloads last month
6
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Base model

Qwen/Qwen3.6-27B
Quantized
(6)
this model

Collection including YTan2000/Qwen3.6-27B-MTP-TQ3_4S