Slep
/

CondViT-B16-txt

 ---
+license: cc-by-nc-4.0
 ---
+# Conditional ViT - B/16 - Text
+*Introduced in **Weakly-Supervised Conditional Embedding for Referred Visual Search**, Lepage et al. 2023*
+[`Paper`](https://arxiv.org/abs/2306.02928) | [`Training Data`](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion) | [`Training Code`](https://github.com/Simon-Lepage/CondViT-LRVSF) | [`Demo`](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo)
+## General Infos
+Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning text is preprocessed by a frozen [Sentence T5-XL](https://huggingface.co/sentence-transformers/sentence-t5-xl).
+Research use only.
+## How to Use
+```python
+from PIL import Image
+import requests
+from transformers import AutoProcessor, AutoModel
+import torch
+model = AutoModel.from_pretrained("Slep/CondViT-B16-txt")
+processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-txt")
+url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg"
+img = Image.open(requests.get(url, stream=True).raw)
+txt = "a brown bag"
+inputs = processor(images=[img], texts=[txt])
+raw_embedding = model(**inputs)
+normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1)
+```