YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DilatedQwen3-0.6B

A Qwen3-0.6B checkpoint repackaged as a custom architecture (model_type: dilated_qwen3) with a non-standard attention pattern. The weights are vanilla Qwen3-0.6B; only how attention is computed changes.

This is a self-contained HuggingFace bundle โ€” it loads with trust_remote_code=True and does not depend on any external repo.

Attention mechanism

Standard Qwen3 self-attention is replaced by a local-dense + dilated long-range causal pattern. Write delta = i - j for the causal distance from query position i to key position j (delta >= 0). Query i attends to key j if and only if:

delta < local_window        # dense local window: every recent token
  OR  delta % dilation == 0  # dilated long range: every dilation-th token

So the most recent local_window tokens are attended in full, and everything older is attended at a stride of dilation, all the way back to the start of the sequence. Both parts are causal.

Defaults: local_window = 128, dilation = 2. Setting dilation = 1 recovers standard causal attention; sequences shorter than local_window are also just full causal attention.

Mask for local_window = 6, dilation = 2 (# = attended, row = query i, column = key j):

     j: 0123456789...
 i= 0   #
 i= 1   ##
 i= 2   ###
 i= 3   ####
 i= 4   #####
 i= 5   ######          <- still inside the local window: dense
 i= 6   #######
 i= 7   .#######         <- past the window: oldest key now skipped (stride 2)
 i= 8   #.#######
 i= 9   .#.#######
 i=10   #.#.#######
 i=11   .#.#.#######

The right-hand ####### run is the dense local window; the #.#. prefix is the dilated long-range tail.

The take-home task

  1. Register this architecture (custom attention) with vLLM.
  2. Profile it end-to-end.
  3. Optimize end-to-end performance.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "DilatedQwen3-0.6B", trust_remote_code=True
)
tok = AutoTokenizer.from_pretrained("DilatedQwen3-0.6B")

trust_remote_code=True is required: model_type="dilated_qwen3" is unknown to transformers, so the architecture must be loaded from the local modeling_dilated_qwen3.py (and registered explicitly in vLLM).

Files

File Purpose
configuration_dilated_qwen3.py Config (local_window, dilation)
modeling_dilated_qwen3.py Model + the local-dense / dilated long-range attention
config.json auto_map โ†’ local files
model.safetensors Weights (Qwen3-0.6B, 596M params)
tokenizer files Qwen3 tokenizer
Downloads last month
16
Safetensors
Model size
0.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support