Image-to-Text
Transformers
Safetensors
English
vlm
text-generation
image-captioning
visual-question-answering
Inference Endpoints
Edit model card

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. uform-vl-english visual encoder,
  2. Sheared-LLaMA-1.3B language model tuned on instruction datasets.

The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.

Usage

pip install uform

The generative model can be used to caption images, summarize their content, or answer questions about them. The exact behavior is controlled by prompts.

from uform.gen_model import VLMForCausalLM, VLMProcessor

model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")

# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")

inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=128,
        eos_token_id=32001,
        pad_token_id=processor.tokenizer.pad_token_id
    )

prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

Evaluation

For captioning evaluation we measure CLIPScore and RefCLIPScore¹.

Model Size Caption Length CLIPScore RefCLIPScore
llava-hf/llava-1.5-7b-hf 7B Long 0.878 0.529
llava-hf/llava-1.5-7b-hf 7B Short 0.886 0.531
Salesforce/instructblip-vicuna-7b 7B Long 0.902 0.534
Salesforce/instructblip-vicuna-7b 7B Short 0.848 0.523
unum-cloud/uform-gen 1.5B Long 0.847 0.523
unum-cloud/uform-gen 1.5B Short 0.842 0.522

Results for VQAv2 evaluation.

Model Size Accuracy
llava-hf/llava-1.5-7b-hf 7B 78.5
unum-cloud/uform-gen 1.5B 66.5

¹ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed

On RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model Size Speed Speedup
llava-hf/llava-1.5-7b-hf 7B ~ 40 tokens/second
Salesforce/instructblip-vicuna-7b 7B ~ 40 tokens/second
unum-cloud/uform-gen 1.5B ~ 140 tokens/second x 3.5
Downloads last month
1,030
Safetensors
Model size
1.49B params
Tensor type
F32
·

Finetuned from

Datasets used to train unum-cloud/uform-gen

Spaces using unum-cloud/uform-gen 3

Collection including unum-cloud/uform-gen