YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DilatedQwen3-0.6B
A Qwen3-0.6B checkpoint repackaged as a custom architecture (model_type: dilated_qwen3) with a non-standard attention pattern. The weights are vanilla
Qwen3-0.6B; only how attention is computed changes.
This is a self-contained HuggingFace bundle โ it loads with
trust_remote_code=True and does not depend on any external repo.
Attention mechanism
Standard Qwen3 self-attention is replaced by a local-dense + dilated
long-range causal pattern. Write delta = i - j for the causal distance from
query position i to key position j (delta >= 0). Query i attends to key
j if and only if:
delta < local_window # dense local window: every recent token
OR delta % dilation == 0 # dilated long range: every dilation-th token
So the most recent local_window tokens are attended in full, and everything
older is attended at a stride of dilation, all the way back to the start of
the sequence. Both parts are causal.
Defaults: local_window = 128, dilation = 2. Setting dilation = 1 recovers
standard causal attention; sequences shorter than local_window are also just
full causal attention.
Mask for local_window = 6, dilation = 2 (# = attended, row = query i,
column = key j):
j: 0123456789...
i= 0 #
i= 1 ##
i= 2 ###
i= 3 ####
i= 4 #####
i= 5 ###### <- still inside the local window: dense
i= 6 #######
i= 7 .####### <- past the window: oldest key now skipped (stride 2)
i= 8 #.#######
i= 9 .#.#######
i=10 #.#.#######
i=11 .#.#.#######
The right-hand ####### run is the dense local window; the #.#. prefix is the
dilated long-range tail.
The take-home task
- Register this architecture (custom attention) with vLLM.
- Profile it end-to-end.
- Optimize end-to-end performance.
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"DilatedQwen3-0.6B", trust_remote_code=True
)
tok = AutoTokenizer.from_pretrained("DilatedQwen3-0.6B")
trust_remote_code=True is required: model_type="dilated_qwen3" is unknown to
transformers, so the architecture must be loaded from the local
modeling_dilated_qwen3.py (and registered explicitly in vLLM).
Files
| File | Purpose |
|---|---|
configuration_dilated_qwen3.py |
Config (local_window, dilation) |
modeling_dilated_qwen3.py |
Model + the local-dense / dilated long-range attention |
config.json |
auto_map โ local files |
model.safetensors |
Weights (Qwen3-0.6B, 596M params) |
| tokenizer files | Qwen3 tokenizer |
- Downloads last month
- 16