Instructions to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="YTan2000/Qwen3.6-27B-MTP-TQ3_4S",
	filename="Qwen3.6-27B-MTP-TQ3_4S.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S
# Run inference directly in the terminal:
llama-cli -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S
# Run inference directly in the terminal:
llama-cli -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S
# Run inference directly in the terminal:
./llama-cli -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Use Docker

docker model run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S

LM Studio
Jan

vLLM

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YTan2000/Qwen3.6-27B-MTP-TQ3_4S"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YTan2000/Qwen3.6-27B-MTP-TQ3_4S",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Ollama
How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Ollama:
```
ollama run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S
```

Unsloth Studio

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YTan2000/Qwen3.6-27B-MTP-TQ3_4S to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YTan2000/Qwen3.6-27B-MTP-TQ3_4S to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for YTan2000/Qwen3.6-27B-MTP-TQ3_4S to start chatting

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "YTan2000/Qwen3.6-27B-MTP-TQ3_4S"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Docker Model Runner:
```
docker model run hf.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S
```

Lemonade

How to use YTan2000/Qwen3.6-27B-MTP-TQ3_4S with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull YTan2000/Qwen3.6-27B-MTP-TQ3_4S

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-TQ3_4S-{{QUANT_TAG}}

List all available models

lemonade list

TurboQwen3.6

Canonical artifact: Qwen3.6-27B-MTP-TQ3_4S

TurboQwen3.6 is the public release name for the TurboQuant GGUF build of the Qwen3.6 27B MTP model line.

The exact file and runtime artifact name remains:

Qwen3.6-27B-MTP-TQ3_4S.gguf

Parent Model

Upstream parent: unsloth/Qwen3.6-27B-MTP-GGUF
Format conversion and TurboQuant packaging: turbo-tan/llama.cpp-tq3

This release is intended for the public TurboQuant runtime fork:

https://github.com/turbo-tan/llama.cpp-tq3

It requires TQ3_4S runtime support and draft-MTP support. It is not expected to run correctly on stock llama.cpp builds that do not contain these extensions.

Matching Projector

The multimodal projector is published separately so the main Hugging Face page stays anchored on the 27B text model:

https://huggingface.co/YTan2000/Qwen3.6-27B-MTP-TQ3_4S-mmproj

Files

Qwen3.6-27B-MTP-TQ3_4S.gguf - main model, 13.39 GiB
mmproj.gguf - matching multimodal projector, 0.87 GiB, hosted in the separate projector repo above
thumbnail.png - model card image
benchmark.png - benchmark summary image

Recommended Runtime

Use flash attention at runtime and enable draft-MTP speculative decoding:

./build/bin/llama-server \
  -m Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --mmproj mmproj.gguf \
  --alias Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 32768 -np 1 -ngl 99 -fa on \
  -ctk q8_0 -ctv tq3_0 \
  --spec-type draft-mtp \
  --spec-draft-n-min 1 \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.0 \
  --reasoning off --jinja

Important build note:

-fa on above is the runtime flash-attention flag.
Do not confuse it with the CMake build flag GGML_CUDA_FA_ALL_QUANTS.
The validated fast release path uses runtime -fa on with GGML_CUDA_FA_ALL_QUANTS=OFF.

Quick Smoke Test

For a smaller local smoke, reduce context to 4096:

./build/bin/llama-server \
  -m Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --mmproj mmproj.gguf \
  --alias Qwen3.6-27B-MTP-TQ3_4S.gguf \
  --host 127.0.0.1 --port 8096 \
  -c 4096 -np 1 -ngl 99 -fa on \
  -ctk q8_0 -ctv tq3_0 \
  --spec-type draft-mtp \
  --spec-draft-n-min 1 \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.0 \
  --reasoning off --jinja --no-warmup

Then:

curl -s http://127.0.0.1:8096/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen3.6-27B-MTP-TQ3_4S.gguf","messages":[{"role":"user","content":"Write ONLY the word ok."}],"max_tokens":32,"temperature":0}'

Expected assistant content:

ok

Benchmark Summary

Local BenchLoop comparison on RTX 3090, using draft-MTP and the runtime settings above:

Metric	Result
Overall score	86.28
EasyCode	100.00%
Hard86	88.4%
Toolcall	96.67%
Data extract	90.97%
Instruct follow	76.67%
Reason math	73.33%
Generation speed	44.80 tok/s
Size	13.39 GiB

The packaged benchmark summary image is included in this repo as benchmark.png.

Notes

This is an MTP release. Use --spec-type draft-mtp with --spec-draft-n-max 2.
Use --spec-draft-p-min 0.0 on the current TurboQuant runtime.
Use -ctk q8_0 -ctv tq3_0 for the validated release profile.
If draft acceptance collapses to 0.00000 on long prompts, stop and check the runtime build and launch flags before benchmarking.