Instructions to use prometheusAIR/Cosmos3-Super-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use prometheusAIR/Cosmos3-Super-FP8 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("prometheusAIR/Cosmos3-Super-FP8", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
Cosmos3-Super — Weight-Only FP8 (NVIDIA ModelOpt)
Weight-only quantization of the Cosmos3OmniTransformer from NVIDIA's
nvidia/Cosmos3-Super — the 64B
omnimodal Cosmos 3 world model (text-to-image, text-to-video, image-to-video,
optional synchronized sound). Produced with
NVIDIA TensorRT Model Optimizer (ModelOpt)
on a single 96 GB workstation GPU, via a streaming method that never materializes
the ~128 GB bf16 model (method scripts included).
Only the transformer is quantized. The VAEs and tokenizers are the original bf16 components, bundled so the repo is self-contained. Loading requires the bundled
load_cosmos3_modelopt.py(see How to use).
Variants & measured performance
Measured on an RTX 6000 Pro Blackwell (96 GB), 1024×1024 single-frame render, 50 steps. Drop-in loading of these repos performs identically to the in-memory quantization path they were validated against.
| Build | Bits (weights) | Repo size | Resident VRAM | s/it (1024² still) |
|---|---|---|---|---|
| FP8 (this repo) | 8-bit (E4M3) | ~64 GB | ~67 GB (meas.) | ~1.2 |
| NVFP4 (sibling) | 4-bit (E2M1 + scales) | ~36 GB | ~43 GB (meas.) | ~4.6 |
Pick FP8 if it fits — in this serving path it is both higher fidelity and ~4× faster, because FP8 dequant is a single cheap scale on a native float8 tensor, while NVFP4 dequant must unpack two 4-bit values per byte and apply two-level block scales in PyTorch. Pick NVFP4 for footprint (it brings the model into ~48 GB-card territory for stills). Note this is dequant-on-the-fly: quantization here buys memory, not speed — NVFP4's hardware FP4 tensor-core advantage only materializes in engines with FP4 GEMM kernels (TRT-LLM/vLLM territory), not in diffusers.
Layers kept in bf16 (not quantized): embeddings, norms, the reasoner head, in/out projections, time/modality adapters, audio adapter. The 64 transformer blocks' attention + MLP linears (incl. MoE experts) are quantized.
Status
- ✅ Drop-in loading verified end to end (load → render → performance parity with the in-memory method) on Blackwell (sm_120), via the bundled loader.
- ✅
modelopt_state.pthis part of the checkpoint and is required — it restores the quantized module structure at load. Do not delete it. - ⚠️ The loader (
load_cosmos3_modelopt.py) is required, not optional. The current diffusers/accelerate/modelopt combination cannot materialize a pre-quantized ModelOpt checkpoint unaided; the loader applies three small, source-verified workarounds (parameter materialization for packed weights, payload-dtype restoration for FP8, and weight-only quantizer enforcement) plus the validated bf16 dtype normalization. ModelOpt marks this path experimental; expect the loader to become unnecessary as upstream catches up. - ❌ vLLM-Omni: not a working path as of 0.22.0. This is an upstream roadmap gap, not a defect of this checkpoint: vLLM-Omni's ModelOpt integration is currently wired for LLMs only, and ModelOpt-quantized diffusion support is an open RFC (#2709, #1959).
- ❌ ComfyUI: no known node support for this ModelOpt layout (the NF4 build linked below has community nodes; this one does not).
- Validated only on Blackwell. FP8 on Hopper/Ada is plausible but unverified here.
How to use
Requires a diffusers build with Cosmos 3 support (currently from source) plus
modelopt and accelerate. Pin to the verified versions for guaranteed
reproducibility (newer versions may also work, but this code path moves fast):
pip install "git+https://github.com/huggingface/diffusers.git@2c7efb95349296cf6bcce981ea036275a82a94df"
pip install accelerate "nvidia-modelopt==0.44.0"
from load_cosmos3_modelopt import load_pipe # bundled in this repo
from diffusers import UniPCMultistepScheduler
pipe = load_pipe("prometheusAIR/Cosmos3-Super-fp8") # or a local path
pipe.scheduler = UniPCMultistepScheduler.from_config(
pipe.scheduler.config, flow_shift=3.0 # NVIDIA's text-to-image setting; use 5.0 for image-to-video
)
# Single image -- pass parameters EXPLICITLY (see warning below):
r = pipe("a weathered lighthouse on a cliff at golden hour, photoreal, 50mm",
height=1024, width=1024, num_frames=1,
num_inference_steps=50, guidance_scale=4.0)
r.video[0].save("out.png") # .video is the list of PIL frames; [0] is the image
# Video (~2 s): frame counts of the form 4n+1 map cleanly to the VAE's 4x
# temporal compression; 24 fps is the native rate and conditions the model.
r = pipe("The lighthouse beam sweeps slowly across the water. Static camera.",
height=704, width=1280, num_frames=49, fps=24.0,
num_inference_steps=35, guidance_scale=6.0)
These still-image settings (1024², 50 steps, guidance 4.0, flow_shift=3.0,
result.video[0]) match NVIDIA's first-party Cosmos3 text-to-image reference.
⚠️ A bare
pipe(prompt)call renders a 189-frame 720×1280 video (~8 s at 24 fps) — that is the pipeline's built-in default, not a still. It takes ~40× the compute of a single frame and is the most common reason this model "seems slow." Always passnum_frames/height/widthexplicitly.
Cosmos 3 expects a dense structured-JSON prompt for best quality; plain prompts work but render softer. See NVIDIA's prompt-upsampling docs.
Reproducing from scratch: quantize_cosmos3_super_streaming.py (included)
streams the bf16 shards directly into compressed FP8/NVFP4 form (peak memory ≈
the compressed footprint, so a single 96 GB card suffices), and
repackage_for_hf.py emits this repo's round-trippable layout via
save_pretrained + enable_huggingface_checkpointing() — note that ModelOpt's
export_hf_checkpoint() produces a deployment checkpoint that diffusers
cannot round-trip; the modelopt_state.pth from save_pretrained is what makes
drop-in loading possible. serve_cosmos3_diffusers.py is a small FastAPI server
(text→image, image→video) around the same model.
Known limitations / caveats
- The bundled loader is required (see Status). FP8 additionally depends on
its payload-dtype restoration: diffusers' loader casts floating params to
torch_dtypewhen no hf_quantizer is present (flagged by a TODO in diffusers' own source), which would otherwise corrupt float8 payloads. - QKV scale unification was skipped at export (ModelOpt's fusion probe doesn't recognize this architecture); q/k/v keep independent scales. Harmless here; relevant only to engines that fuse QKV.
- Render sharpness depends heavily on prompt density, scheduler settings, and guidance — tune these; they are not quantization loss.
Guardrails
Cosmos 3 ships an optional safety checker (cosmos_guardrail). The bundled
loader passes enable_safety_checker=False for local single-user use. If you
deploy this or publish generated media, install cosmos-guardrail, accept the
gated nvidia/Cosmos-Guardrail1
model (released under its own NVIDIA Open Model License, separate from this
repo's OpenMDW-1.1), and run with load_pipe(..., enable_safety_checker=True).
Provenance & License
- Derivative of:
nvidia/Cosmos3-Super(bf16). This repo modifies only the weight encoding of the transformer. - Produced with: NVIDIA TensorRT Model Optimizer + diffusers (from source).
- Exact versions used:
diffusers 0.39.0.dev0@2c7efb9,nvidia-modelopt 0.44.0,accelerate 1.13.0,torch 2.12.0, CUDA 13.3. - License: OpenMDW-1.1, inherited from the base model. This repo includes a copy of the agreement (
LICENSE) and documents its origin above; the upstream repo ships no separate NOTICE file. OpenMDW-1.1 permits modification and redistribution and places no restrictions on generated outputs; you remain responsible for clearing any third-party rights embodied in the materials.
Related repos
- Sibling NVFP4 build (smaller footprint, ~36 GB):
prometheusAIR/Cosmos3-Super-nvfp4 - Original (bf16, source):
nvidia/Cosmos3-Super - NF4 (bitsandbytes; broad GPU compatibility incl. Ampere/Ada; drop-in + ComfyUI nodes):
SanDiegoDude/Cosmos3-Super-nf4— a good choice if you are not on Blackwell-class hardware or want turnkey ComfyUI support.
- Downloads last month
- 26
Model tree for prometheusAIR/Cosmos3-Super-FP8
Base model
nvidia/Cosmos3-Super