Cosmos-Reason2-2B ONNX (portable, for FP8 TensorRT engine build)

Portable ONNX export of nvidia/Cosmos-Reason2-2B (Qwen3-VL-2B VLM), ready for building FP8 TensorRT engines on NVIDIA Jetson AGX Thor (SM 11.0) and other SM 9.0+ GPUs.

Contents

  • llm.onnx (+ llm.onnx.data and weight shards) - Qwen3VL text decoder (bf16, opset 18, eager attention)
  • visual_enc_onnx/visual_encoder.onnx - Qwen3VL vision encoder (bf16, opset 17)
  • config.json, generation_config.json, tokenizer.*, chat_template.json, preprocessor_config.json, video_preprocessor_config.json - from source model

FP8 quantization

Weights are exported in bfloat16. FP8 quantization is applied at TensorRT engine build time on the target device, using TensorRT native FP8 calibration. This is the recommended path for SM 9.0+ (H100, L40S) and SM 11.0 (Jetson AGX Thor / Blackwell Jetson), as TensorRT can optimize layer-wise FP8 scales for the specific hardware.

For prebuilt Jetson Thor engines (SM 11.0): see companion repo cagataydev/cosmos-reason2-2b-fp8-trt-thor-sm110.

Build engines on Jetson Thor

# 1. Download this repo
hf download cagataydev/cosmos-reason2-2b-fp8-onnx --local-dir ./cosmos-onnx

# 2. Build LLM engine (FP8)
trtexec \
  --onnx=./cosmos-onnx/llm.onnx \
  --fp8 --bf16 \
  --saveEngine=engines/cosmos-reason2-2b-fp8-llm.engine \
  --minShapes=input_ids:1x1,attention_mask:1x1,position_ids:3x1x1 \
  --optShapes=input_ids:1x512,attention_mask:1x512,position_ids:3x1x512 \
  --maxShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:3x1x1024

# 3. Build Vision engine (FP8)
trtexec \
  --onnx=./cosmos-onnx/visual_enc_onnx/visual_encoder.onnx \
  --fp8 --bf16 \
  --saveEngine=visual_engines/cosmos-reason2-2b-fp8-visual.engine \
  --minShapes=pixel_values:4x1176,grid_thw:1x3 \
  --optShapes=pixel_values:1024x1176,grid_thw:1x3 \
  --maxShapes=pixel_values:10240x1176,grid_thw:8x3

Or use the IntBot TensorRT-Edge-LLM builders (llm_build, visual_build).

Export notes

The Qwen3VL text decoder could not be exported with the standard HuggingFace-wrapped forward() because of two upstream issues:

  1. @check_model_inputs decorator triggers _Map_base::at / unordered_map::at inside torch._functorch.autograd_function.custom_function_call_vmap_generate_rule (torch 2.6 + transformers 4.57.6 interaction with create_causal_mask)
  2. SDPA with GQA + position_ids is not convertible to ONNX (scaled_dot_product_attention not implemented if enable_gqa is True)

Our workaround (see export_v4.py):

  • Load with attn_implementation="eager"
  • Re-implement the text decoder forward inline, bypassing the decorator chain
  • Construct a plain causal mask manually (no functorch custom autograd fn)
  • Use dynamo=True path with external_data=True for shardable output

Provenance

  • Hardware: AWS EC2, NVIDIA L40S, Ubuntu 24.04
  • torch==2.6.0+cu124, transformers==4.57.6, nvidia-modelopt==0.43.0
  • torch.onnx.export(dynamo=True, opset_version=18, external_data=True)
  • Produced by DevDuck auto-pipeline, 2026-05-07

Limitations

  • Weights are in bf16; end-to-end FP8 requires a TensorRT build step.
  • The LLM ONNX is a single-pass forward (no KV cache); to add KV-cache support for streaming generation, re-export with past_key_values inputs/outputs.
Downloads last month
51
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cagataydev/cosmos-reason2-2b-fp8-onnx

Quantized
(12)
this model