UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift

The uform3-image-text-multilingual-base UForm model is a tiny vision and multilingual language encoder, covering 21 languages, mapping them into a shared vector space. This model produces up to 256-dimensional embeddings and is made of:

Text encoder: 12-layer BERT for up to 50 input tokens.
Visual encoder: ViT-B/16 for images of 224 x 224 resolution.

Unlike most CLIP-like multomodal models, this model shares 4 layers between the text and visual encoder to allow for more data- and parameter-efficient training. Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code. If you need a larger, more accurate, or multilingual model, check our HuggingFace Hub. For more details on running the model, check out the UForm GitHub repository.

Evaluation

For all evaluations, the multimodal part was used unless otherwise stated.

Monolingual

Dataset	Recall@1	Recall@5	Recall@10
Zero-Shot Flickr	0.558	0.813	0.874
MS-COCO ¹	0.401	0.680	0.781

¹ It's important to note, that the MS-COCO train split was present in the training data.

Multilingual

Recall@10 on the XTD-10 dataset:

English	German	Spanish	French	Italian	Russian	Japanese	Korean	Turkish	Chinese	Polish
96.1	93.5	95.7	94.1	94.4	90.4	90.2	91.3	95.2	93.8	95.8

Recall@1, Recall@5, and Recall@10 on the COCO-SM dataset:

Target Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
Arabic	22.7	31.7	44.9	57.8	55.8	69.2	274 M
Armenian	5.6	22.0	14.3	44.7	20.2	56.0	4 M
Chinese	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
English	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
French	31.3	35.4	56.5	62.6	67.4	73.3	274 M
German	31.7	35.1	56.9	62.2	67.4	73.3	134 M
Hebrew	23.7	26.7	46.3	51.8	57.0	63.5	9 M
Hindi	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Indonesian	26.9	30.7	51.4	57.0	62.7	68.6	199 M
Italian	31.3	34.9	56.7	62.1	67.1	73.1	67 M
Japanese	27.4	32.6	51.5	59.2	62.6	70.6	125 M
Korean	24.4	31.5	48.1	57.8	59.2	69.2	81 M
Persian	24.0	28.8	47.0	54.6	57.8	66.2	77 M
Polish	29.2	33.6	53.9	60.1	64.7	71.3	41 M
Portuguese	31.6	32.7	57.1	59.6	67.9	71.0	257 M
Russian	29.9	33.9	54.8	60.9	65.8	72.0	258 M
Spanish	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Thai	21.5	28.7	43.0	54.6	53.7	66.0	61 M
Turkish	25.5	33.0	49.1	59.6	60.3	70.8	88 M
Ukranian	26.0	30.6	49.9	56.7	60.9	68.1	41 M
Vietnamese	25.4	28.3	49.2	53.9	60.3	65.5	85 M

Mean	26.5±6.4	31.8±3.5	49.8±9.8	58.1±4.5	60.4±10.6	69.4±4.3	-
Google Translate	27.4±6.3	31.5±3.5	51.1±9.5	57.8±4.4	61.7±10.3	69.1±4.3	-
Microsoft Translator	27.2±6.4	31.4±3.6	50.8±9.8	57.7±4.7	61.4±10.6	68.9±4.6	-
Meta NLLB	24.9±6.7	32.4±3.5	47.5±10.3	58.9±4.5	58.2±11.2	70.2±4.3	-

For a deeper comparison of output ranking check the following table for the Normalized Discounted Cumulative Gains for the first 20 results - NDCG@20:

	Arabic	Armenian	Chinese	French	German	Hebrew	Hindi	Indonesian	Italian	Japanese	Korean	Persian	Polish	Portuguese	Russian	Spanish	Thai	Turkish	Ukranian	Vietnamese	Mean (all)	Mean (Google Translate)	Mean(Microsoft Translator)	Mean(NLLB)
OpenCLIP NDCG	0.639	0.204	0.731	0.823	0.806	0.657	0.616	0.733	0.811	0.737	0.686	0.667	0.764	0.832	0.777	0.849	0.606	0.701	0.704	0.697	0.716 ± 0.149	0.732 ± 0.145	0.730 ± 0.149	0.686 ± 0.158
UForm NDCG	0.868	0.691	0.880	0.932	0.927	0.791	0.879	0.870	0.930	0.885	0.869	0.831	0.897	0.897	0.906	0.939	0.822	0.898	0.851	0.818	0.875 ± 0.064	0.869 ± 0.063	0.869 ± 0.066	0.888 ± 0.064

Installation

pip install "uform[torch,onnx]"

Usage

To load the model:

from uform import get_model, Modality

import requests
from io import BytesIO
from PIL import Image

model_name = 'unum-cloud/uform3-image-text-multilingual-base'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

To encode the content:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))

image_data = processor_image(image)
text_data = processor_text(text)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

unum-cloud
/

uform3-image-text-multilingual-base

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift

Evaluation

Monolingual

Multilingual

Installation

Usage

Datasets used to train unum-cloud/uform3-image-text-multilingual-base

Collection including unum-cloud/uform3-image-text-multilingual-base

UForm 3 Encoders

UForm

Pocket-Sized Multimodal AI For Content Understanding and Generation In Python, JavaScript, and Swift

Evaluation

Monolingual

Multilingual

Installation

Usage

Datasets used to train unum-cloud/uform3-image-text-multilingual-base

Collection including unum-cloud/uform3-image-text-multilingual-base

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift