File size: 4,072 Bytes
2095254
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3190ebb
 
 
2095254
3190ebb
2095254
 
 
 
 
 
 
 
 
3190ebb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---

license: cc-by-nc-4.0
model-index:
- name: CondViT-B16-cat
  results:
  - dataset:
      name: LAION - Referred Visual Search - Fashion
      split: test
      type: Slep/LAION-RVS-Fashion
    metrics:
    - name: R@1 +10K Dist.
      type: recall_at_1|10000
      value: 93.44 ± 0.83
    - name: R@5 +10K Dist.
      type: recall_at_5|10000
      value: 98.07 ± 0.37
    - name: R@10 +10K Dist.
      type: recall_at_10|10000
      value: 98.69 ± 0.38
    - name: R@20 +10K Dist.
      type: recall_at_20|10000
      value: 98.98 ± 0.34
    - name: R@50 +10K Dist.
      type: recall_at_50|10000
      value: 99.55 ± 0.18
    - name: R@1 +100K Dist.
      type: recall_at_1|100000
      value: 85.90 ± 1.37
    - name: R@5 +100K Dist.
      type: recall_at_5|100000
      value: 94.22 ± 0.87
    - name: R@10 +100K Dist.
      type: recall_at_10|100000
      value: 96.04 ± 0.68
    - name: R@20 +100K Dist.
      type: recall_at_20|100000
      value: 97.18 ± 0.56
    - name: R@50 +100K Dist.
      type: recall_at_50|100000
      value: 98.28 ± 0.34
    - name: R@1 +500K Dist.
      type: recall_at_1|500000
      value: 78.19 ± 1.59
    - name: R@5 +500K Dist.
      type: recall_at_5|500000
      value: 88.70 ± 1.15
    - name: R@10 +500K Dist.
      type: recall_at_10|500000
      value: 91.46 ± 1.02
    - name: R@20 +500K Dist.
      type: recall_at_20|500000
      value: 94.07 ± 0.86
    - name: R@50 +500K Dist.
      type: recall_at_50|500000
      value: 96.11 ± 0.64
    - name: R@1 +1M Dist.
      type: recall_at_1|1000000
      value: 74.49 ± 1.23
    - name: R@5 +1M Dist.
      type: recall_at_5|1000000
      value: 85.38 ± 1.29
    - name: R@10 +1M Dist.
      type: recall_at_10|1000000
      value: 88.95 ± 1.15
    - name: R@20 +1M Dist.
      type: recall_at_20|1000000
      value: 91.35 ± 0.93
    - name: R@50 +1M Dist.
      type: recall_at_50|1000000
      value: 94.75 ± 0.75
    - name: Available Dists.
      type: n_dists
      value: 2000014
    - name: Embedding Dimension
      type: embedding_dim
      value: 512
    - name: Conditioning
      type: conditioning
      value: category
    source:
      name: LRVSF Leaderboard
      url: https://huggingface.co/spaces/Slep/LRVSF-Leaderboard
    task:
      type: Retrieval
tags:
- lrvsf-benchmark
datasets:
- Slep/LAION-RVS-Fashion
---


# Conditional ViT - B/16 - Categories

*Introduced in <a href=https://arxiv.org/abs/2306.02928>**LRVSF-Fashion: Extending Visual Search with Referring Instructions**</a>, Lepage et al. 2023*

<div align="center">
<div id=links>

|Data|Code|Models|Spaces|
|:-:|:-:|:-:|:-:|
|[Full Dataset](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion)|[Training Code](https://github.com/Simon-Lepage/CondViT-LRVSF)|[Categorical Model](https://huggingface.co/Slep/CondViT-B16-cat)|[LRVS-F Leaderboard](https://huggingface.co/spaces/Slep/LRVSF-Leaderboard)|
|[Test set](https://zenodo.org/doi/10.5281/zenodo.11189942)|[Benchmark Code](https://github.com/Simon-Lepage/LRVSF-Benchmark)|[Textual Model](https://huggingface.co/Slep/CondViT-B16-txt)|[Demo](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo)|
</div>
</div>

## General Infos

Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following : 
- Bags
- Feet
- Hands
- Head
- Lower Body
- Neck
- Outwear
- Upper Body
- Waist
- Whole Body

Research use only.

## How to Use 

```python
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("Slep/CondViT-B16-cat")
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat")

url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg"
img = Image.open(requests.get(url, stream=True).raw)
cat = "Bags"

inputs = processor(images=[img], categories=[cat])
raw_embedding = model(**inputs)
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1)
```