Qwen3-VL-Embedding-2B β ONNX
ONNX export of Qwen/Qwen3-VL-Embedding-2B, split into a separate vision encoder and transformer decoder so each component can be consumed independently by ONNX Runtime or built into a TensorRT engine.
The image resolution and temporal size are baked into the ONNX graph (see Configuration). Only seq_len on the transformer is dynamic.
Repository contents
| File / Dir | Purpose |
|---|---|
Vision.onnx (+ .onnx.data) |
Vision encoder. Fixed input resolution and temporal size baked in. |
Transformer.onnx (+ .onnx.data) |
Transformer decoder layers, embedding mode (no KV cache), dynamic seq_len. |
rotary_params.npz |
mRoPE parameters (inv_freq, mrope_section), token embedding weights, and image/grid config (image_height/width, height_factor/width_factor, patch_size, merge_size, hidden_size, head_dim, β¦). Required alongside the ONNX files at inference time. |
tokenizer/ |
HF Qwen3VLProcessor / tokenizer files for text + image preprocessing. |
export_script/ |
Scripts used to produce the ONNX files (see Reproducing the export). |
text_prompt_APOv2.1_BF16/ |
Sample text prompts (fire / smoke detection) used for downstream evaluation β not required for loading the model. |
Configuration
The export was produced with:
| Variable | Value | Notes |
|---|---|---|
IMG_SIZE |
(768, 768) |
Fixed input resolution β must be a multiple of patch_size Γ merge_size. |
TEMPORAL_SIZE |
1 |
Frames per clip. |
Changing either of these requires re-running the full export pipeline.
Usage
Load the ONNX files with ONNX Runtime and apply the mRoPE / token-embedding parameters from rotary_params.npz around the transformer. High-level flow:
- Preprocess inputs with the tokenizer / processor in
tokenizer/. - Run image pixels through
Vision.onnxto obtain visual tokens. - Look up text token embeddings using the weights saved in
rotary_params.npz. - Concatenate visual + text embeddings into the transformer input sequence.
- Run the sequence through
Transformer.onnxwith mRoPE parameters fromrotary_params.npz. - The final pooled hidden state is the multimodal embedding.
TensorRT
These ONNX files are designed to be converted to TensorRT engines. Notes for the TRT build:
- FP16 is safe globally, except for normalization-sensitive layers β force those to FP32 to avoid overflow in the RMSNorm
x * rsqrt(square(x).sum())pattern. - Engines are not portable across GPU architectures or TRT versions and must be built on the target machine.
Reproducing the export
Scripts in export_script/ regenerate the ONNX assets from the HF PyTorch checkpoint:
# Install dependencies
pip install -r export_script/requirements.txt
# Full pipeline
bash export_script/run_all.sh
# Or run the stages individually
python export_script/a_export_to_onnx.py
python export_script/b_export_onnx_vision.py
a_export_to_onnx.pyβ ExportsTransformer.onnxand savesrotary_params.npz. Also produces an initialVision.onnxvia a manual path (norm fusion, GELU replacement), which gets overwritten in step b.b_export_onnx_vision.pyβ Re-exportsVision.onnxby wrapping the HFQwen3VLVisionModeldirectly. Traces the exact PyTorch code path, so numerics match HF.qwen3_vl_embedding.pyβ Model wrapper used by the export scripts.
The TensorRT build step is not included in this repo; it depends on your local GPU / TRT installation.
Known limitations
The export pipeline currently only produces reliable engines for TEMPORAL_SIZE β€ 2. Beyond that, the Torch vs TRT cosine similarity drops below 0.99 and parity can no longer be guaranteed. If you need longer temporal contexts, the export path will need further investigation (likely around the temporal patching / rotary handling).
License
Apache-2.0, inherited from the base model. See Qwen/Qwen3-VL-Embedding-2B for upstream terms.