DilatedQwen3-0.6B

A Qwen3-0.6B checkpoint repackaged as a custom architecture (model_type: dilated_qwen3) with a non-standard attention pattern. The weights are vanilla Qwen3-0.6B; only how attention is computed changes.

This is a self-contained HuggingFace bundle — it loads with trust_remote_code=True and does not depend on any external repo.

Attention mechanism

Standard Qwen3 self-attention is replaced by a local-dense + dilated long-range causal pattern. Write delta = i - j for the causal distance from query position i to key position j (delta >= 0). Query i attends to key j if and only if:

delta < local_window        # dense local window: every recent token
  OR  delta % dilation == 0  # dilated long range: every dilation-th token

So the most recent local_window tokens are attended in full, and everything older is attended at a stride of dilation, all the way back to the start of the sequence. Both parts are causal.

Defaults: local_window = 128, dilation = 2. Setting dilation = 1 recovers standard causal attention; sequences shorter than local_window are also just full causal attention.

Mask for local_window = 6, dilation = 2 (# = attended, row = query i, column = key j):

     j: 0123456789...
 i= 0   #
 i= 1   ##
 i= 2   ###
 i= 3   ####
 i= 4   #####
 i= 5   ######          <- still inside the local window: dense
 i= 6   #######
 i= 7   .#######         <- past the window: oldest key now skipped (stride 2)
 i= 8   #.#######
 i= 9   .#.#######
 i=10   #.#.#######
 i=11   .#.#.#######

The right-hand ####### run is the dense local window; the #.#. prefix is the dilated long-range tail.

The take-home task

Register this architecture (custom attention) with vLLM.
Profile it end-to-end.
Optimize end-to-end performance.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "DilatedQwen3-0.6B", trust_remote_code=True
)
tok = AutoTokenizer.from_pretrained("DilatedQwen3-0.6B")

trust_remote_code=True is required: model_type="dilated_qwen3" is unknown to transformers, so the architecture must be loaded from the local modeling_dilated_qwen3.py (and registered explicitly in vLLM).

Files

File	Purpose
`configuration_dilated_qwen3.py`	Config (`local_window`, `dilation`)
`modeling_dilated_qwen3.py`	Model + the local-dense / dilated long-range attention
`config.json`	`auto_map` → local files
`model.safetensors`	Weights (Qwen3-0.6B, 596M params)
tokenizer files	Qwen3 tokenizer

Downloads last month: 16

Safetensors

Model size

0.8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support