About extracting embedding vectors of images and texts.

#10
by iceleaf97tech - opened

Is it possible to extract embedding vectors of images and texts using these models?
If so, how should I do that?
Can you provide the template of codes? thx

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Multimodal visual VQA models are not recommended for embedding extraction:

  • VQA models are primarily designed for visual question-answering tasks, with architectures and optimization goals that differ from embedding extraction.
  • The CLIP model is specifically trained for aligned embeddings of images and text, providing better performance and greater adaptability.

Using CLIP for embedding extraction is more efficient and better suited to the practical requirements of embedding tasks.

zRzRzRzRzRzRzR changed discussion status to closed

Sign up or log in to comment