Instructions to use nvidia/Cosmos3-Super with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use nvidia/Cosmos3-Super with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Diffusers
How to use nvidia/Cosmos3-Super with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("nvidia/Cosmos3-Super", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
The Confusing Part About Cosmos3 Super I2V vs T2V vs “Base” Itself, Even Though the Model Card Says Omnimodal
Since the model card states that Cosmos3 is omnimodal, is it generally accepted that all Cosmos3 variants support all input types, but that I2V is better for image input and T2V is better for text input, "base" model for text gen ?
Or, if you want to do a T2V task, do you specifically have to use the T2V model? Likewise, for text generation, do you have to use the “base” Cosmos3 Super model, and so on?
Is my understanding correct?
my understanding is the base model does everything. But if you want specifically t2i or i2v, while you COULD use the main Cosmos3 Super model for this, you might get better results with their SFTed versions for that specific task.