Dimension mismatch crashes the engine at startup

#18

by Shaulsh - opened about 16 hours ago

From the Dynamo/FakeTensor trace, the crash is a single F.linear(x, W) call:

call_function torch.nn.functional.linear(
x = FakeTensor(size=(1, s70, 4096), bf16), # input: [..., 4096]
W = Parameter(size=(3840, 8192), bf16), # weight: [out=3840, in=8192]
bias=None
) → RuntimeError: a and b must have same reduction dim, but got [s70, 4096] X [8192, 3840]
A nn.Linear stores weight as [out_features, in_features], so this layer expects an 8192‑wide input (and outputs 3840 = the text hidden size). But the tensor handed to it is 4096‑wide. linear does x @ Wᵀ → needs 4096 == 8192 → fails. (s70 is just the symbolic sequence length from the trace.)

Where (the call stack)

vllm/model_executor/models/transformers/multimodal.py:348 forward
vllm/model_executor/models/transformers/base.py:653 forward
transformers/models/gemma4_unified/modeling_gemma4_unified.py:1121 forward ← the linear is here
This happens inside EngineCore._initialize_kv_caches → determine_available_memory — vLLM's startup dummy forward (it runs the model once under torch.compile/FakeTensors to size the KV cache), so it dies before any real inference.

What's actually wrong
The two key facts:

The stack goes through vllm/model_executor/models/transformers/… — i.e. vLLM has no native gemma4_unified implementation; it's running HF's modeling_gemma4_unified.py through its generic "Transformers backend" wrapper.
8192 = 2 × 4096. That Linear expects a concatenated/fused input (two 4096‑wide streams → 8192), which is exactly what the encoder‑free "unified" Gemma‑4 does internally (it projects raw audio/vision wave features into the text space and fuses them). vLLM's generic Transformers backend only feeds it a single 4096‑wide stream — it doesn't reproduce the unified model's fusion/concat step.
So the root cause is a vLLM↔Transformers integration gap for the gemma4_unified architecture: the generic backend mis‑drives the unified model's fused projection, feeding 4096 where the layer wants 8192.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment