Qwen3-VL-Embedding-2B β€” ONNX

ONNX export of Qwen/Qwen3-VL-Embedding-2B, split into a separate vision encoder and transformer decoder so each component can be consumed independently by ONNX Runtime or built into a TensorRT engine.

The image resolution and temporal size are baked into the ONNX graph (see Configuration). Only seq_len on the transformer is dynamic.

Repository contents

File / Dir Purpose
Vision.onnx (+ .onnx.data) Vision encoder. Fixed input resolution and temporal size baked in.
Transformer.onnx (+ .onnx.data) Transformer decoder layers, embedding mode (no KV cache), dynamic seq_len.
rotary_params.npz mRoPE parameters (inv_freq, mrope_section), token embedding weights, and image/grid config (image_height/width, height_factor/width_factor, patch_size, merge_size, hidden_size, head_dim, …). Required alongside the ONNX files at inference time.
tokenizer/ HF Qwen3VLProcessor / tokenizer files for text + image preprocessing.
export_script/ Scripts used to produce the ONNX files (see Reproducing the export).
text_prompt_APOv2.1_BF16/ Sample text prompts (fire / smoke detection) used for downstream evaluation β€” not required for loading the model.

Configuration

The export was produced with:

Variable Value Notes
IMG_SIZE (768, 768) Fixed input resolution β€” must be a multiple of patch_size Γ— merge_size.
TEMPORAL_SIZE 1 Frames per clip.

Changing either of these requires re-running the full export pipeline.

Usage

Load the ONNX files with ONNX Runtime and apply the mRoPE / token-embedding parameters from rotary_params.npz around the transformer. High-level flow:

  1. Preprocess inputs with the tokenizer / processor in tokenizer/.
  2. Run image pixels through Vision.onnx to obtain visual tokens.
  3. Look up text token embeddings using the weights saved in rotary_params.npz.
  4. Concatenate visual + text embeddings into the transformer input sequence.
  5. Run the sequence through Transformer.onnx with mRoPE parameters from rotary_params.npz.
  6. The final pooled hidden state is the multimodal embedding.

TensorRT

These ONNX files are designed to be converted to TensorRT engines. Notes for the TRT build:

  • FP16 is safe globally, except for normalization-sensitive layers β€” force those to FP32 to avoid overflow in the RMSNorm x * rsqrt(square(x).sum()) pattern.
  • Engines are not portable across GPU architectures or TRT versions and must be built on the target machine.

Reproducing the export

Scripts in export_script/ regenerate the ONNX assets from the HF PyTorch checkpoint:

# Install dependencies
pip install -r export_script/requirements.txt

# Full pipeline
bash export_script/run_all.sh

# Or run the stages individually
python export_script/a_export_to_onnx.py
python export_script/b_export_onnx_vision.py
  • a_export_to_onnx.py β€” Exports Transformer.onnx and saves rotary_params.npz. Also produces an initial Vision.onnx via a manual path (norm fusion, GELU replacement), which gets overwritten in step b.
  • b_export_onnx_vision.py β€” Re-exports Vision.onnx by wrapping the HF Qwen3VLVisionModel directly. Traces the exact PyTorch code path, so numerics match HF.
  • qwen3_vl_embedding.py β€” Model wrapper used by the export scripts.

The TensorRT build step is not included in this repo; it depends on your local GPU / TRT installation.

Known limitations

The export pipeline currently only produces reliable engines for TEMPORAL_SIZE ≀ 2. Beyond that, the Torch vs TRT cosine similarity drops below 0.99 and parity can no longer be guaranteed. If you need longer temporal contexts, the export path will need further investigation (likely around the temporal patching / rotary handling).

License

Apache-2.0, inherited from the base model. See Qwen/Qwen3-VL-Embedding-2B for upstream terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for PIA-SPACE-LAB/Qwen3-VL-Embedding-2B-onnx

Quantized
(16)
this model