--- license: apache-2.0 pipeline_tag: feature-extraction tags: - clip - vision datasets: - Ziyang/yfcc15m - conceptual_captions ---

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift

--- The `uform3-image-text-english-large` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space. This model produces up to __64-, 256-, 512-, and 768-dimensional embeddings__ and is made of: * Text encoder: 12-layer BERT for up to 64 input tokens. * Visual encoder: ViT-L/14 for images of 224 x 224 resolution. Unlike most CLIP-like multomodal models, this model shares 6 layers between the text and visual encoder to allow for more data- and parameter-efficient training. Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code. If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/). For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/). ## Evaluation For zero-shot ImageNet classification the model achieves Top-1 accuracy of 51.8% and Top-5 of 75.6%. On text-to-image retrieval it reaches 92% Recall@10 for Flickr: | Dataset | Recall@1 | Recall@5 | Recall@10 | | :-------- | ------: | --------: | --------: | | Zero-Shot Flickr | 0.693 | 0.875 | 0.923 | | Zero-Shot MS-COCO | 0.382 | 0.617 | 0.728 | ## Installation ```bash pip install "uform[torch,onnx]" ``` ## Usage To load the model: ```python from uform import get_model, Modality import requests from io import BytesIO from PIL import Image model_name = 'unum-cloud/uform3-image-text-english-base' modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER] processors, models = get_model(model_name, modalities=modalities) model_text = models[Modality.TEXT_ENCODER] model_image = models[Modality.IMAGE_ENCODER] processor_text = processors[Modality.TEXT_ENCODER] processor_image = processors[Modality.IMAGE_ENCODER] ``` To encode the content: ```python text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background' image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg' image_url = Image.open(BytesIO(requests.get(image_url).content)) image_data = processor_image(image) text_data = processor_text(text) image_features, image_embedding = model_image.encode(image_data, return_features=True) text_features, text_embedding = model_text.encode(text_data, return_features=True) ```