TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Peer Rheinboldt · Frédéric Berdoz · Roger Wattenhofer

arXiv

Preprint, submitted June 2026


Quick Start

TreeFlash requires trust_remote_code=True because the drafter architecture and spec_generate method are provided by this repository.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

drafter = AutoModel.from_pretrained(
    "peerrh/treeflash-qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "qwen/qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

tokenizer = AutoTokenizer.from_pretrained("qwen/qwen3-4b", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer([text], return_tensors="pt").to(drafter.device)

output_ids = drafter.spec_generate(
    target=target,
    input_ids=inputs["input_ids"],
    max_new_tokens=2048,
    stop_token_ids=[tokenizer.eos_token_id],
    temperature=0.0,
    drafter_temperature=1.0,
    tree_size=64,
    top_m=16,
)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Supported Models


Citation

If you use TreeFlash, please cite:

@article{rheinboldt2026treeflash,
  title={TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding},
  author={Rheinboldt, Peer and Berdoz, Fr{\'e}d{\'e}ric and Wattenhofer, Roger},
  journal={arXiv preprint arXiv:2606.03819},
  year={2026}
}
Downloads last month
143
Safetensors
Model size
0.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including peerrh/treeflash-qwen3-4b

Paper for peerrh/treeflash-qwen3-4b