Can you use this model with image and text-only inputs apart from video?

#4
by lunahr - opened

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

Nemo Station org

No, image inference doesn't work β€” we trained Marlin-2B for video specifically.
Freezing the ViT alone wouldn't preserve image capability anyway. The dominant drift isn't in the encoder itself, it's in the merger between encoder and LLM β€” even with a frozen ViT, the merger re-aligns to the changing LLM during fine-tuning, which is where most of the capability erosion comes from. Our own v0 SFT went the other way: a frozen ViT was *under-*trained, and video quality improved once we unfroze it with vit_lr=1e-4. If you want to preserve an upstream capability, you mix image data into the SFT β€” freezing alone isn't enough.

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

You could try to convert a image into a video and give it that. I dont know how it would behave tho.

Nemo Station org

Surely will try that and share the results

Sign up or log in to comment