Exciting News in AI: JinaAI Releases JINA-CLIP-v2!
The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:
๐ Technical Highlights: - Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder - Supports 89 languages with 8,192 token context length - Processes images up to 512ร512 pixels with 14ร14 patch size - Implements FlashAttention2 for text and xFormers for vision processing - Uses Matryoshka Representation Learning for efficient vector storage
โก๏ธ Under The Hood: - Multi-stage training process with progressive resolution scaling (224โ384โ512) - Contrastive learning using InfoNCE loss in both directions - Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs - Incorporates specialized datasets for document understanding, scientific graphs, and infographics - Uses hard negative mining with 7 negatives per positive sample
๐ Performance: - Outperforms previous models on visual document retrieval (52.65% nDCG@5) - Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark - Strong multilingual performance across 30 languages - Maintains performance even with 75% dimension reduction (256D vs 1024D)
๐ฏ Key Innovation: The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!
Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!
I wrote a recent survey about deep reinforcement learning. The paper is a compact guide to understand some of the key concepts in reinforcement learning. Find the paper below:
OmniVision-968M: a new local VLM for edge devices, fast & small but performant ๐จ a new vision language model with 9x less image tokens, super efficient ๐ aligned with DPO for reducing hallucinations โก๏ธ Apache 2.0 license ๐ฅ
Models ๐ป Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! ๐
๐ผ๏ธ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers ๐
๐ผ๏ธ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts ๐ Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
๐ฎ AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay ๐คฏ
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture ๐