Cosmos3-Super — Weight-Only FP8 (NVIDIA ModelOpt)

Weight-only quantization of the Cosmos3OmniTransformer from NVIDIA's nvidia/Cosmos3-Super — the 64B omnimodal Cosmos 3 world model (text-to-image, text-to-video, image-to-video, optional synchronized sound). Produced with NVIDIA TensorRT Model Optimizer (ModelOpt) on a single 96 GB workstation GPU, via a streaming method that never materializes the ~128 GB bf16 model (method scripts included).

Only the transformer is quantized. The VAEs and tokenizers are the original bf16 components, bundled so the repo is self-contained. Loading requires the bundled load_cosmos3_modelopt.py (see How to use).

Variants & measured performance

Measured on an RTX 6000 Pro Blackwell (96 GB), 1024×1024 single-frame render, 50 steps. Drop-in loading of these repos performs identically to the in-memory quantization path they were validated against.

Build	Bits (weights)	Repo size	Resident VRAM	s/it (1024² still)
FP8 (this repo)	8-bit (E4M3)	~64 GB	~67 GB (meas.)	~1.2
NVFP4 (sibling)	4-bit (E2M1 + scales)	~36 GB	~43 GB (meas.)	~4.6

Pick FP8 if it fits — in this serving path it is both higher fidelity and ~4× faster, because FP8 dequant is a single cheap scale on a native float8 tensor, while NVFP4 dequant must unpack two 4-bit values per byte and apply two-level block scales in PyTorch. Pick NVFP4 for footprint (it brings the model into ~48 GB-card territory for stills). Note this is dequant-on-the-fly: quantization here buys memory, not speed — NVFP4's hardware FP4 tensor-core advantage only materializes in engines with FP4 GEMM kernels (TRT-LLM/vLLM territory), not in diffusers.

Layers kept in bf16 (not quantized): embeddings, norms, the reasoner head, in/out projections, time/modality adapters, audio adapter. The 64 transformer blocks' attention + MLP linears (incl. MoE experts) are quantized.

Status

✅ Drop-in loading verified end to end (load → render → performance parity with the in-memory method) on Blackwell (sm_120), via the bundled loader.
✅ modelopt_state.pth is part of the checkpoint and is required — it restores the quantized module structure at load. Do not delete it.
⚠️ The loader (load_cosmos3_modelopt.py) is required, not optional. The current diffusers/accelerate/modelopt combination cannot materialize a pre-quantized ModelOpt checkpoint unaided; the loader applies three small, source-verified workarounds (parameter materialization for packed weights, payload-dtype restoration for FP8, and weight-only quantizer enforcement) plus the validated bf16 dtype normalization. ModelOpt marks this path experimental; expect the loader to become unnecessary as upstream catches up.
❌ vLLM-Omni: not a working path as of 0.22.0. This is an upstream roadmap gap, not a defect of this checkpoint: vLLM-Omni's ModelOpt integration is currently wired for LLMs only, and ModelOpt-quantized diffusion support is an open RFC (#2709, #1959).
❌ ComfyUI: no known node support for this ModelOpt layout (the NF4 build linked below has community nodes; this one does not).
Validated only on Blackwell. FP8 on Hopper/Ada is plausible but unverified here.

How to use

Requires a diffusers build with Cosmos 3 support (currently from source) plus modelopt and accelerate. Pin to the verified versions for guaranteed reproducibility (newer versions may also work, but this code path moves fast):

pip install "git+https://github.com/huggingface/diffusers.git@2c7efb95349296cf6bcce981ea036275a82a94df"
pip install accelerate "nvidia-modelopt==0.44.0"

from load_cosmos3_modelopt import load_pipe   # bundled in this repo
from diffusers import UniPCMultistepScheduler

pipe = load_pipe("prometheusAIR/Cosmos3-Super-fp8")   # or a local path
pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config, flow_shift=3.0   # NVIDIA's text-to-image setting; use 5.0 for image-to-video
)

# Single image -- pass parameters EXPLICITLY (see warning below):
r = pipe("a weathered lighthouse on a cliff at golden hour, photoreal, 50mm",
         height=1024, width=1024, num_frames=1,
         num_inference_steps=50, guidance_scale=4.0)
r.video[0].save("out.png")   # .video is the list of PIL frames; [0] is the image

# Video (~2 s): frame counts of the form 4n+1 map cleanly to the VAE's 4x
# temporal compression; 24 fps is the native rate and conditions the model.
r = pipe("The lighthouse beam sweeps slowly across the water. Static camera.",
         height=704, width=1280, num_frames=49, fps=24.0,
         num_inference_steps=35, guidance_scale=6.0)

These still-image settings (1024², 50 steps, guidance 4.0, flow_shift=3.0, result.video[0]) match NVIDIA's first-party Cosmos3 text-to-image reference.

⚠️ A bare pipe(prompt) call renders a 189-frame 720×1280 video (~8 s at 24 fps) — that is the pipeline's built-in default, not a still. It takes ~40× the compute of a single frame and is the most common reason this model "seems slow." Always pass num_frames/height/width explicitly.

Cosmos 3 expects a dense structured-JSON prompt for best quality; plain prompts work but render softer. See NVIDIA's prompt-upsampling docs.

Reproducing from scratch: quantize_cosmos3_super_streaming.py (included) streams the bf16 shards directly into compressed FP8/NVFP4 form (peak memory ≈ the compressed footprint, so a single 96 GB card suffices), and repackage_for_hf.py emits this repo's round-trippable layout via save_pretrained + enable_huggingface_checkpointing() — note that ModelOpt's export_hf_checkpoint() produces a deployment checkpoint that diffusers cannot round-trip; the modelopt_state.pth from save_pretrained is what makes drop-in loading possible. serve_cosmos3_diffusers.py is a small FastAPI server (text→image, image→video) around the same model.

Known limitations / caveats

The bundled loader is required (see Status). FP8 additionally depends on its payload-dtype restoration: diffusers' loader casts floating params to torch_dtype when no hf_quantizer is present (flagged by a TODO in diffusers' own source), which would otherwise corrupt float8 payloads.
QKV scale unification was skipped at export (ModelOpt's fusion probe doesn't recognize this architecture); q/k/v keep independent scales. Harmless here; relevant only to engines that fuse QKV.
Render sharpness depends heavily on prompt density, scheduler settings, and guidance — tune these; they are not quantization loss.

Guardrails

Cosmos 3 ships an optional safety checker (cosmos_guardrail). The bundled loader passes enable_safety_checker=False for local single-user use. If you deploy this or publish generated media, install cosmos-guardrail, accept the gated nvidia/Cosmos-Guardrail1 model (released under its own NVIDIA Open Model License, separate from this repo's OpenMDW-1.1), and run with load_pipe(..., enable_safety_checker=True).

Provenance & License

Derivative of: nvidia/Cosmos3-Super (bf16). This repo modifies only the weight encoding of the transformer.
Produced with: NVIDIA TensorRT Model Optimizer + diffusers (from source).
Exact versions used: diffusers 0.39.0.dev0 @ 2c7efb9, nvidia-modelopt 0.44.0, accelerate 1.13.0, torch 2.12.0, CUDA 13.3.
License: OpenMDW-1.1, inherited from the base model. This repo includes a copy of the agreement (LICENSE) and documents its origin above; the upstream repo ships no separate NOTICE file. OpenMDW-1.1 permits modification and redistribution and places no restrictions on generated outputs; you remain responsible for clearing any third-party rights embodied in the materials.

Related repos

Sibling NVFP4 build (smaller footprint, ~36 GB): prometheusAIR/Cosmos3-Super-nvfp4
Original (bf16, source): nvidia/Cosmos3-Super
NF4 (bitsandbytes; broad GPU compatibility incl. Ampere/Ada; drop-in + ComfyUI nodes): SanDiegoDude/Cosmos3-Super-nf4 — a good choice if you are not on Blackwell-class hardware or want turnkey ComfyUI support.

Downloads last month: 26

Model tree for prometheusAIR/Cosmos3-Super-FP8

Base model

nvidia/Cosmos3-Super

Finetuned

(3)

this model