Instructions to use peerrh/treeflash-qwen3-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use peerrh/treeflash-qwen3-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="peerrh/treeflash-qwen3-4b", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("peerrh/treeflash-qwen3-4b", trust_remote_code=True)
model = AutoModel.from_pretrained("peerrh/treeflash-qwen3-4b", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use peerrh/treeflash-qwen3-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "peerrh/treeflash-qwen3-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "peerrh/treeflash-qwen3-4b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/peerrh/treeflash-qwen3-4b

SGLang

How to use peerrh/treeflash-qwen3-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "peerrh/treeflash-qwen3-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "peerrh/treeflash-qwen3-4b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "peerrh/treeflash-qwen3-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "peerrh/treeflash-qwen3-4b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use peerrh/treeflash-qwen3-4b with Docker Model Runner:
```
docker model run hf.co/peerrh/treeflash-qwen3-4b
```

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Peer Rheinboldt · Frédéric Berdoz · Roger Wattenhofer

Preprint, submitted June 2026

Quick Start

TreeFlash requires trust_remote_code=True because the drafter architecture and spec_generate method are provided by this repository.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

drafter = AutoModel.from_pretrained(
    "peerrh/treeflash-qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "qwen/qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

tokenizer = AutoTokenizer.from_pretrained("qwen/qwen3-4b", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer([text], return_tensors="pt").to(drafter.device)

output_ids = drafter.spec_generate(
    target=target,
    input_ids=inputs["input_ids"],
    max_new_tokens=2048,
    stop_token_ids=[tokenizer.eos_token_id],
    temperature=0.0,
    drafter_temperature=1.0,
    tree_size=64,
    top_m=16,
)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Supported Models

Target	Drafter
Qwen/Qwen3-4B	peerrh/treeflash-qwen3-4b
Qwen/Qwen3-8B	peerrh/treeflash-qwen3-8b
Qwen/Qwen3-Coder-30B-A3B-Instruct	peerrh/treeflash-qwen3-coder-30b-a3b

Citation

If you use TreeFlash, please cite:

@article{rheinboldt2026treeflash,
  title={TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding},
  author={Rheinboldt, Peer and Berdoz, Fr{\'e}d{\'e}ric and Wattenhofer, Roger},
  journal={arXiv preprint arXiv:2606.03819},
  year={2026}
}

Downloads last month: 143

Safetensors

Model size

0.7B params

Tensor type

BF16

Collection including peerrh/treeflash-qwen3-4b

TreeFlash

Collection

Parallel AR-Approximation for Faster Speculative Decoding (https://arxiv.org/abs/2606.03819) • 3 items • Updated 9 days ago

Paper for peerrh/treeflash-qwen3-4b

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Paper • 2606.03819 • Published 26 days ago