The Confusing Part About Cosmos3 Super I2V vs T2V vs “Base” Itself, Even Though the Model Card Says Omnimodal

#9
by komixenon - opened

Since the model card states that Cosmos3 is omnimodal, is it generally accepted that all Cosmos3 variants support all input types, but that I2V is better for image input and T2V is better for text input, "base" model for text gen ?

Or, if you want to do a T2V task, do you specifically have to use the T2V model? Likewise, for text generation, do you have to use the “base” Cosmos3 Super model, and so on?
Is my understanding correct?

my understanding is the base model does everything. But if you want specifically t2i or i2v, while you COULD use the main Cosmos3 Super model for this, you might get better results with their SFTed versions for that specific task.

Sign up or log in to comment