Transformers
Inference Endpoints
kimihailv's picture
Update README.md
2c15412
metadata
license: apache-2.0
language:
  - en
  - de
  - es
  - fr
  - it
  - ja
  - ko
  - pl
  - ru
  - tr
  - zh
  - ar

UForm

Multi-Modal Inference Library
For Semantic Search Applications


UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (12 languages) with:

  • 12 layers BERT (8 layers for unimodal encoding and rest layers for multimodal encoding)
  • ViT-B/16 (image resolution is 224x224)

The model was trained on balanced multilingual dataset.

If you need English model, check this.

If you need more languages, check this.

Evaluation

The following metrics were obtained with multimodal re-ranking:

Monolingual

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.558 0.813 0.874
MS-COCO (train split was in training data) 0.401 0.680 0.781

Multilingual (XTD-10)

Metric is recall@10

English German Spanish French Italian Russian Japanese Korean Turkish Chinese Polish
96.3 92.6 94.5 94.4 94.4 90.4 88.3 92.5 94.4 93.6 95.0

Installation

pip install uform

Usage

To load the model:

import uform

model = uform.get_model('unum-cloud/uform-vl-english')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

  • Computationally cheap.
  • Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
  • Suitable for retrieval in large collections.

Cons:

  • Takes into account only coarse-grained features.

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

Pros:

  • Joint embedding captures fine-grained features.
  • Suitable for re-ranking – sorting retrieval result.

Cons:

  • Resource-intensive.
  • Not suitable for retrieval in large collections.