Feature Extraction
clip
vision
kimihailv's picture
Update README.md
34d2fcd verified
metadata
license: apache-2.0
language:
  - en
  - ar
  - hy
  - zh
  - fr
  - de
  - he
  - hi
  - id
  - it
  - ja
  - ko
  - fa
  - pl
  - pt
  - ru
  - es
  - th
  - tr
  - uk
  - vi
pipeline_tag: feature-extraction
tags:
  - clip
  - vision
datasets:
  - sbu_captions
  - visual_genome
  - ChristophSchuhmann/MS_COCO_2017_URL_TEXT

UForm

Multi-Modal Inference Library
For Semantic Search Applications


UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (21 languages) with:

  • 12 layers BERT (8 layers for unimodal encoding and rest layers for multimodal encoding)
  • ViT-B/16 (image resolution is 224x224)

The model was trained on balanced multilingual dataset.

If you need English model, check this.

Evaluation

For all evaluations, the multimodal part was used unless otherwise stated.

Monolingual

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.558 0.813 0.874
MS-COCO (train split was in training data) 0.401 0.680 0.781

Multilingual

XTD-10

Metric is recall@10

English German Spanish French Italian Russian Japanese Korean Turkish Chinese Polish
96.1 93.5 95.7 94.1 94.4 90.4 90.2 91.3 95.2 93.8 95.8

COCO-SM

For this evaluation only unimodal part was used.

Recall

Target Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5±6.4 31.8±3.5 49.8±9.8 58.1±4.5 60.4±10.6 69.4±4.3 -
Google Translate 27.4±6.3 31.5±3.5 51.1±9.5 57.8±4.4 61.7±10.3 69.1±4.3 -
Microsoft Translator 27.2±6.4 31.4±3.6 50.8±9.8 57.7±4.7 61.4±10.6 68.9±4.6 -
Meta NLLB 24.9±6.7 32.4±3.5 47.5±10.3 58.9±4.5 58.2±11.2 70.2±4.3 -

NDCG@20

Arabic Armenian Chinese French German Hebrew Hindi Indonesian Italian Japanese Korean Persian Polish Portuguese Russian Spanish Thai Turkish Ukranian Vietnamese Mean (all) Mean (Google Translate) Mean(Microsoft Translator) Mean(NLLB)
OpenCLIP NDCG 0.639 0.204 0.731 0.823 0.806 0.657 0.616 0.733 0.811 0.737 0.686 0.667 0.764 0.832 0.777 0.849 0.606 0.701 0.704 0.697 0.716 ± 0.149 0.732 ± 0.145 0.730 ± 0.149 0.686 ± 0.158
UForm NDCG 0.868 0.691 0.880 0.932 0.927 0.791 0.879 0.870 0.930 0.885 0.869 0.831 0.897 0.897 0.906 0.939 0.822 0.898 0.851 0.818 0.875 ± 0.064 0.869 ± 0.063 0.869 ± 0.066 0.888 ± 0.064

Installation

pip install uform[torch]

Usage

To load the model:

import uform

model, processor = uform.get_model('unum-cloud/uform-vl-multilingual-v2')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

  • Computationally cheap.
  • Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
  • Suitable for retrieval in large collections.

Cons:

  • Takes into account only coarse-grained features.

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

Pros:

  • Joint embedding captures fine-grained features.
  • Suitable for re-ranking – sorting retrieval result.

Cons:

  • Resource-intensive.
  • Not suitable for retrieval in large collections.