CAIP (Contrastive Action-Image Pre-training)

Project page

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

REPO = "yuvansharma/caip-vitl256"
model = AutoModel.from_pretrained(REPO, trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(REPO, trust_remote_code=True)

image = Image.open("example.png").convert("RGB")
inputs = processor(images=image, text="pick up the red cup", return_tensors="pt")

with torch.no_grad():
    out = model(**inputs)

# out.image_pooled    [B, 1024]       text-conditioned pooled image embedding
# out.patch_features  [B, 256, 1024]  patch tokens
# out.text_tokens     [B, 64, 1024]   text token embeddings
# out.text_pooled     [B, 1024]       pooled text embedding

For a smaller download (~1.75 GB), load the bf16 weights with revision="bf16".

Citation

@misc{TODO,
}

Downloads last month: 23

Safetensors

Model size

0.9B params

Tensor type

F32