--- library_name: transformers license: cc-by-nc-4.0 tags: - xlm-roberta - eva02 - clip - feature-extraction - sentence-similarity - retrieval - multimodal - multi-modal - crossmodal - cross-modal - mteb - clip-benchmark - vidore - transformers - sentence-transformers - onnx - safetensors - transformers.js language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh inference: false ---

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The embedding set trained by Jina AI.

Jina CLIP v2: Multilingual Multimodal Embeddings for Texts and Images

## Quick Start [Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings) ## Intended Usage & Model Info `jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**. It is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as: * *support for multiple languages* - the text tower now supports 100 languages with tuning focus on *Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,* and *Vietnamese.* * *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs. * *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures. Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks. This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model. ## Data & Parameters [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon! ## Usage 1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/). 2. Alternatively, you can use the model directly via the transformers/sentence-transformers package. ```python # !pip install transformers einops timm pillow from transformers import AutoModel # Initialize the model model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True) # Sentences sentences = [ "A neural network walks into a bar and forgets why it came.", "Why do programmers prefer dark mode? Because light attracts bugs.", ] # Public image URLs image_urls = [ "https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg", "https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg", ] # Choose a matryoshka dimension, set to None to get the full 1024-dim vectors truncate_dim = 512 # Encode text and images text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim) image_embeddings = model.encode_image( image_urls, truncate_dim=truncate_dim ) # also accepts PIL.image, local filenames, dataURI # Encode query text query = "tell me a joke about AI" text_query_embeddings = model.encode_text( query, task="retrieval.query", truncate_dim=truncate_dim ) # Compute similarities print(text_query_embeddings @ text_embeddings[1].T) # text embedding similarity print(text_query_embeddings @ image_embeddings[0].T) # text-image cross-modal similarity print(image_embeddings[0] @ image_embeddings[1].T) # image-image cross-modal similarity print(image_embeddings[0] @ text_embeddings[0].T) # image-text cross-modal similarity ``` or via sentence-transformers: ```python # !pip install sentence-transformers einops timm pillow from sentence_transformers import SentenceTransformer # Initialize the model truncate_dim = 512 model = SentenceTransformer( "jinaai/jina-clip-v2", trust_remote_code=True, truncate_dim=truncate_dim ) # Sentences sentences = [ "A neural network walks into a bar and forgets why it came.", "Why do programmers prefer dark mode? Because light attracts bugs.", ] # Public image URLs image_urls = [ "https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg", "https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg", ] text_embeddings = model.encode(sentences) image_embeddings = model.encode(image_urls) query = "tell me a joke about AI" text_query_embeddings = model.encode(query, prompt_name="retrieval.query") ``` JavaScript developers can use Jina CLIP via the [transformers.js](https://huggingface.co/docs/transformers.js) library. Note that to use this model, you need to install transformers.js [v3](https://github.com/xenova/transformers.js/tree/v3) from source using `npm install xenova/transformers.js#v3`. ```js import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers'; // Load tokenizer and text model const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v2'); const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v2'); // Load processor and vision model const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32'); const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v2'); // Run tokenization const texts = [ 'A neural network walks into a bar and forgets why it came.', 'Why do programmers prefer dark mode? Because light attracts bugs.', ]; const text_inputs = tokenizer(texts, { padding: true, truncation: true }); // Compute text embeddings const { text_embeds } = await text_model(text_inputs); // Read images and run processor const urls = [ 'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg', 'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg' ]; const image = await Promise.all(urls.map(url => RawImage.read(url))); const image_inputs = await processor(image); // Compute vision embeddings const { image_embeds } = await vision_model(image_inputs); // Compute similarities console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity ``` ## Contact Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. ## Citation If you find `jina-clip-v2` useful in your research, please cite the following paper: ```bibtex @misc{2405.20204, Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao}, Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever}, Year = {2024}, Eprint = {arXiv:2405.20204}, } ```