Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs ๐คฏ All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c
TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval. This will enable better image-text retrieval, better zero-shot image classification, better vision language models ๐ฅ Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)