unum-cloud
/

uform3-image-text-english-base

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

uform3-image-text-english-base / README.md

ashvardanian's picture

Update README.md

f735d24 verified 4 months ago

|

history blame contribute delete

No virus

2.78 kB

	---
	license: apache-2.0
	pipeline_tag: feature-extraction
	tags:
	- clip
	- vision
	datasets:
	- Ziyang/yfcc15m
	- conceptual_captions
	---
	<h1 align="center">UForm</h1>
	<h3 align="center">
	Pocket-Sized Multimodal AI<br/>
	For Content Understanding and Generation<br/>
	In Python, JavaScript, and Swift<br/>
	</h3>

	---

	The `uform3-image-text-english-base` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
	This model produces up to __256-dimensional embeddings__ and is made of:

	* Text encoder: 4-layer BERT for up to 64 input tokens.
	* Visual encoder: ViT-B/16 for images of 224 x 224 resolution.

	Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
	Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
	If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
	For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).

	## Evaluation

	On text-to-image retrieval it reaches 94% Recall@10 for Flickr:

	\| Dataset \| Recall@1 \| Recall@5 \| Recall@10 \|
	\| :-------- \| -------: \| -------: \| --------: \|
	\| Zero-Shot Flickr \| 0.727 \| 0.915 \| 0.949 \|
	\| MS-COCO ¹ \| 0.510 \| 0.761 \| 0.838 \|

	> ¹ It's important to note, that the MS-COCO train split was present in the training data.

	## Installation

	```bash
	pip install "uform[torch,onnx]"
	```

	## Usage

	To load the model:

	```python
	from uform import get_model, Modality

	import requests
	from io import BytesIO
	from PIL import Image

	model_name = 'unum-cloud/uform3-image-text-english-base'
	modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
	processors, models = get_model(model_name, modalities=modalities)

	model_text = models[Modality.TEXT_ENCODER]
	model_image = models[Modality.IMAGE_ENCODER]
	processor_text = processors[Modality.TEXT_ENCODER]
	processor_image = processors[Modality.IMAGE_ENCODER]
	```

	To encode the content:

	```python
	text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
	image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
	image_url = Image.open(BytesIO(requests.get(image_url).content))

	image_data = processor_image(image)
	text_data = processor_text(text)
	image_features, image_embedding = model_image.encode(image_data, return_features=True)
	text_features, text_embedding = model_text.encode(text_data, return_features=True)
	```