Instructions to use AngelSlim/Qwen3-4b-dflare with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AngelSlim/Qwen3-4b-dflare with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AngelSlim/Qwen3-4b-dflare")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("AngelSlim/Qwen3-4b-dflare") model = AutoModel.from_pretrained("AngelSlim/Qwen3-4b-dflare") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AngelSlim/Qwen3-4b-dflare with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AngelSlim/Qwen3-4b-dflare" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AngelSlim/Qwen3-4b-dflare", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AngelSlim/Qwen3-4b-dflare
- SGLang
How to use AngelSlim/Qwen3-4b-dflare with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AngelSlim/Qwen3-4b-dflare" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AngelSlim/Qwen3-4b-dflare", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AngelSlim/Qwen3-4b-dflare" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AngelSlim/Qwen3-4b-dflare", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AngelSlim/Qwen3-4b-dflare with Docker Model Runner:
docker model run hf.co/AngelSlim/Qwen3-4b-dflare
DFlare Draft Model for Qwen3-4B
This is the official DFlare draft model checkpoint for Qwen/Qwen3-4B, released alongside the paper:
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
DFlare is a block-diffusion speculative decoding framework that accelerates large language model inference by predicting an entire block of tokens in one shot for the target model to verify in parallel. It removes the narrow conditioning bottleneck of the prior state-of-the-art DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and giving every draft layer a distinct input. Combined with training-data scaling, this enhanced per-layer expressiveness allows the draft model to scale to deeper architectures with consistent gains, achieving 5.52ร end-to-end speedup on Qwen3-4B without compromising output quality.
๐ Documentation & code: https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html
๐ฆ Repo: Tencent/AngelSlim
Model Details
| Target model | Qwen/Qwen3-4B |
| Draft architecture | DFlare (7 layers, hidden_size=2560, attention_heads=32, GQA kv_heads=8) |
| Parameters | ~743 M |
| Block size | 16 |
| Target layers used for fusion | [1, 5, 9, 13, 17, 21, 25, 29, 33] (out of 36) |
| Precision | bfloat16 |
| RoPE | rope_theta = 1,000,000 (no scaling) |
| Vocab size | 151,936 |
| Tied embeddings | yes (tie_word_embeddings = true) |
The draft predicts a block of block_size tokens in parallel, conditioned on (i) target hidden states extracted from the listed target layers and (ii) noise embeddings of the previous block. The target model verifies the block in a single forward pass and accepts the longest matching prefix.
How to Use
This checkpoint is loaded with AngelSlim's QwenDFlareDraftModel class.
1. Install AngelSlim
git clone https://github.com/Tencent/AngelSlim.git
cd AngelSlim
pip install -e .
2. Run end-to-end speculative decoding benchmark
The repo ships a self-contained benchmark entry that supports both DFlash and DFlare drafts via --draft-arch:
# Single-GPU
python tools/dflash_benchmark.py \
--model-name-or-path Qwen/Qwen3-4B \
--draft-name-or-path dflare/qwen3-4b-dflare \
--draft-arch dflare \
--dataset gsm8k \
--max-samples 128 \
--max-new-tokens 2048 \
--temperature 0.0
# 8-GPU (workload sharded across ranks, results gathered to rank 0)
torchrun --nproc_per_node=8 --master_port=29600 \
tools/dflash_benchmark.py \
--model-name-or-path Qwen/Qwen3-4B \
--draft-name-or-path dflare/qwen3-4b-dflare \
--draft-arch dflare \
--dataset gsm8k \
--max-samples 128 \
--max-new-tokens 2048 \
--temperature 0.0
The script reports:
- Decoding speedup vs. single-token autoregressive decoding
- Average acceptance length per block
- Per-block acceptance-length histogram
โ ๏ธ Do not pass
--block-sizeโ the benchmark readsblock_size=16from this checkpoint'sconfig.jsonand overriding it will break the train/test alignment.
Supported datasets out of the box: gsm8k, math500, aime24, aime25, alpaca, mt-bench, humaneval, mbpp, lbpp, swe-bench, livecodebench.
3. Load the checkpoint manually
import torch
from angelslim.compressor.speculative.train.models.draft.qwen_dflare import (
QwenDFlareDraftModel,
)
draft = QwenDFlareDraftModel.from_pretrained(
"dflare/qwen3-4b-dflare",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).cuda().eval()
print(draft.target_layer_ids) # [1, 5, 9, 13, 17, 21, 25, 29, 33]
print(draft.block_size) # 16
print(draft.mask_token_id) # 151669
Performance
On six benchmarks spanning mathematical reasoning, code generation, and conversation, DFlare on Qwen3-4B delivers 5.52ร average wall-clock speedup over single-token autoregressive decoding โ improving over DFlash by roughly 11%, with no degradation in output quality (the target model verifies every block, so the final distribution is identical to greedy decoding).
For full per-task results, ablations, and acceptance-length distributions, see the official documentation.
Citation
@article{DFlare2026,
title={DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding},
author={Jiebin Zhang and Zhenghan Yu and Song Liu and Eugene J. Yu and Zheng Li and Dawei Zhu and Jiangshan Duo and Weimin Xiong and Yifan Song and Guanghua Yu and Jianchen Zhu and Sujian Li},
journal={arXiv preprint arXiv},
year={2026}
}
License
This checkpoint is released under the Apache 2.0 license, following the AngelSlim project. The target model Qwen/Qwen3-4B retains its own license; consult the target model card before deployment.
- Downloads last month
- 33