Multi Modal Capabilities?

#7
by Vvkishere - opened

I think this also has potential if we can train models like Clip to change domains for image based data. The usage could be more fine tuned analysis of object detection and image retrieval.

NLP Group of The University of Hong Kong org

Thanks for your comment!

I agree that the INSTRUCTOR model can be potentially applied in other domains for computer vision. The key part would probably be the design of the instruction and the merge to the input. For the training/fine-tuning part, I think it would be similar to that of the INSTRUCTOR model with contrastive loss.

Sign up or log in to comment