metadata

license: mit

Conditional ViT - B/16 - Categories

Introduced in Weakly-Supervised Conditional Embedding for Referred Visual Search, Lepage et al. 2023

Paper | Training Data | Training Code | Demo

General Infos

Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following :

Bags
Feet
Hands
Head
Lower Body
Neck
Outwear
Upper Body
Waist
Whole Body

Research use only.

How to Use

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("Slep/CondViT-B16-cat")
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat")

url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg"
img = Image.open(requests.get(url, stream=True).raw)
cat = "Bags"

inputs = processor(images=[img], categories=[cat])
raw_embedding = model(**inputs)
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1)