Instructions to use cagataydev/cosmos-reason2-2b-fp8-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cagataydev/cosmos-reason2-2b-fp8-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="cagataydev/cosmos-reason2-2b-fp8-onnx") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("cagataydev/cosmos-reason2-2b-fp8-onnx") model = AutoModelForImageTextToText.from_pretrained("cagataydev/cosmos-reason2-2b-fp8-onnx") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - TensorRT
How to use cagataydev/cosmos-reason2-2b-fp8-onnx with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use cagataydev/cosmos-reason2-2b-fp8-onnx with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cagataydev/cosmos-reason2-2b-fp8-onnx" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cagataydev/cosmos-reason2-2b-fp8-onnx", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/cagataydev/cosmos-reason2-2b-fp8-onnx
- SGLang
How to use cagataydev/cosmos-reason2-2b-fp8-onnx with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cagataydev/cosmos-reason2-2b-fp8-onnx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cagataydev/cosmos-reason2-2b-fp8-onnx", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cagataydev/cosmos-reason2-2b-fp8-onnx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cagataydev/cosmos-reason2-2b-fp8-onnx", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use cagataydev/cosmos-reason2-2b-fp8-onnx with Docker Model Runner:
docker model run hf.co/cagataydev/cosmos-reason2-2b-fp8-onnx
Cosmos-Reason2-2B ONNX (portable, for FP8 TensorRT engine build)
Portable ONNX export of nvidia/Cosmos-Reason2-2B (Qwen3-VL-2B VLM), ready for building FP8 TensorRT engines on NVIDIA Jetson AGX Thor (SM 11.0) and other SM 9.0+ GPUs.
Contents
llm.onnx(+llm.onnx.dataand weight shards) - Qwen3VL text decoder (bf16, opset 18, eager attention)visual_enc_onnx/visual_encoder.onnx- Qwen3VL vision encoder (bf16, opset 17)config.json,generation_config.json,tokenizer.*,chat_template.json,preprocessor_config.json,video_preprocessor_config.json- from source model
FP8 quantization
Weights are exported in bfloat16. FP8 quantization is applied at TensorRT engine build time on the target device, using TensorRT native FP8 calibration. This is the recommended path for SM 9.0+ (H100, L40S) and SM 11.0 (Jetson AGX Thor / Blackwell Jetson), as TensorRT can optimize layer-wise FP8 scales for the specific hardware.
For prebuilt Jetson Thor engines (SM 11.0): see companion repo cagataydev/cosmos-reason2-2b-fp8-trt-thor-sm110.
Build engines on Jetson Thor
# 1. Download this repo
hf download cagataydev/cosmos-reason2-2b-fp8-onnx --local-dir ./cosmos-onnx
# 2. Build LLM engine (FP8)
trtexec \
--onnx=./cosmos-onnx/llm.onnx \
--fp8 --bf16 \
--saveEngine=engines/cosmos-reason2-2b-fp8-llm.engine \
--minShapes=input_ids:1x1,attention_mask:1x1,position_ids:3x1x1 \
--optShapes=input_ids:1x512,attention_mask:1x512,position_ids:3x1x512 \
--maxShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:3x1x1024
# 3. Build Vision engine (FP8)
trtexec \
--onnx=./cosmos-onnx/visual_enc_onnx/visual_encoder.onnx \
--fp8 --bf16 \
--saveEngine=visual_engines/cosmos-reason2-2b-fp8-visual.engine \
--minShapes=pixel_values:4x1176,grid_thw:1x3 \
--optShapes=pixel_values:1024x1176,grid_thw:1x3 \
--maxShapes=pixel_values:10240x1176,grid_thw:8x3
Or use the IntBot TensorRT-Edge-LLM
builders (llm_build, visual_build).
Export notes
The Qwen3VL text decoder could not be exported with the standard HuggingFace-wrapped
forward() because of two upstream issues:
@check_model_inputsdecorator triggers_Map_base::at/unordered_map::atinsidetorch._functorch.autograd_function.custom_function_call_vmap_generate_rule(torch 2.6 + transformers 4.57.6 interaction withcreate_causal_mask)- SDPA with GQA + position_ids is not convertible to ONNX
(
scaled_dot_product_attention not implemented if enable_gqa is True)
Our workaround (see export_v4.py):
- Load with
attn_implementation="eager" - Re-implement the text decoder forward inline, bypassing the decorator chain
- Construct a plain causal mask manually (no functorch custom autograd fn)
- Use
dynamo=Truepath withexternal_data=Truefor shardable output
Provenance
- Hardware: AWS EC2, NVIDIA L40S, Ubuntu 24.04
torch==2.6.0+cu124,transformers==4.57.6,nvidia-modelopt==0.43.0torch.onnx.export(dynamo=True, opset_version=18, external_data=True)- Produced by DevDuck auto-pipeline, 2026-05-07
Limitations
- Weights are in bf16; end-to-end FP8 requires a TensorRT build step.
- The LLM ONNX is a single-pass forward (no KV cache); to add KV-cache support for
streaming generation, re-export with
past_key_valuesinputs/outputs.
- Downloads last month
- 51