The Confusing Part About Cosmos3 Super I2V vs T2V vs “Base” Itself, Even Though the Model Card Says Omnimodal

by komixenon - opened about 21 hours ago

•

Since the model card states that Cosmos3 is omnimodal, is it generally accepted that all Cosmos3 variants support all input types, but that I2V is better for image input and T2V is better for text input, "base" model for text gen ?

Or, if you want to do a T2V task, do you specifically have to use the T2V model? Likewise, for text generation, do you have to use the “base” Cosmos3 Super model, and so on?
Is my understanding correct?

Sentdex

about 15 hours ago

my understanding is the base model does everything. But if you want specifically t2i or i2v, while you COULD use the main Cosmos3 Super model for this, you might get better results with their SFTed versions for that specific task.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment