ali-vilab/text-to-video-ms-1.7b · How does modelscope know what it knows.

Does anyone understand what exactly this model was trained on. The most obvious answer is Shutterstock, but prompts such as eating and drinking almost feel like they are based off of low resolution 3d models…the mouths deform in usual ways that I couldn’t imagine a computer learning from watching videos.

The reason I am asking, is in order to figure out the limitations of fine tuning this model, and what types of footage will allow for an improved model. I have started fine tuning my own models, and hope to post my discoveries, and would love to hear more from others who use the model and finetune the model.