UNIT_600M / README.md
yeeaa's picture
Update README.md
f3c30de verified
metadata
license: apache-2.0

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Install

pip install torch==2.1.0 
pip install timm==0.9.12 
pip install transformers==4.32.1

This project supports both NVIDIA and Ascend GPUs.

Usage

import torch
from PIL import Image
from transformers import CLIPImageProcessor

from unit import UNITModel

model_path = "/path/to/UNIT_600M/"

model = UNITModel.from_pretrained(model_path)

model.to(device='cuda')
model.eval()

image_processor = CLIPImageProcessor.from_pretrained(model_path)

image = Image.open("test.jpg").convert('RGB')

image_input = image_processor(image)['pixel_values'][0]
image_tensor = torch.tensor(image_input).unsqueeze(0).to(torch.bfloat16).cuda()

with torch.set_grad_enabled(False):
    cls_tokens, spatial_tokens = model(image_tensor)