Update README.md
Browse files
README.md
CHANGED
@@ -6,12 +6,48 @@ tags:
|
|
6 |
- contrastive learning
|
7 |
---
|
8 |
|
9 |
-
FLAIR Model
|
10 |
|
11 |
Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](https://kim-sanghwan.github.io/), [Mariana-Iuliana Georgescu](https://lilygeorgescu.github.io/), [Zeynep Akata](https://www.eml-munich.de/people/zeynep-akata), [Stephan Alaniz](https://www.eml-munich.de/people/stephan-alaniz)
|
12 |
|
13 |
FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
|
17 |
|
|
|
6 |
- contrastive learning
|
7 |
---
|
8 |
|
9 |
+
**FLAIR Model**
|
10 |
|
11 |
Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](https://kim-sanghwan.github.io/), [Mariana-Iuliana Georgescu](https://lilygeorgescu.github.io/), [Zeynep Akata](https://www.eml-munich.de/people/zeynep-akata), [Stephan Alaniz](https://www.eml-munich.de/people/stephan-alaniz)
|
12 |
|
13 |
FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
|
14 |
|
15 |
+
**Usage**
|
16 |
+
We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
|
17 |
+
|
18 |
+
```python
|
19 |
+
import flair
|
20 |
+
from PIL import Image
|
21 |
+
import torch
|
22 |
+
|
23 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
24 |
+
print(f"Using device: {device}")
|
25 |
+
|
26 |
+
pretrained = flair.download_weights_from_hf(model_repo='xiaorui638/flair', filename='flair-cc3m-recap.pt')
|
27 |
+
model, _, preprocess = flair.create_model_and_transforms('ViT-B-16-FLAIR', pretrained=pretrained)
|
28 |
+
|
29 |
+
model.to(device)
|
30 |
+
model.eval()
|
31 |
+
|
32 |
+
tokenizer = flair.get_tokenizer('ViT-B-16-FLAIR')
|
33 |
+
|
34 |
+
image = preprocess(Image.open("../assets/puppy.jpg")).unsqueeze(0).to(device)
|
35 |
+
|
36 |
+
text = tokenizer(["In the image, a small white puppy with black ears and eyes is the main subject", # ground-truth caption
|
37 |
+
"The white door behind the puppy is closed, and there's a window on the right side of the door", # ground-truth caption
|
38 |
+
"A red ladybug is surrounded by green glass beads", # non-ground-truth caption
|
39 |
+
"Dominating the scene is a white desk, positioned against a white brick wall"]).to(device) # non-ground-truth caption
|
40 |
+
|
41 |
+
with torch.no_grad(), torch.cuda.amp.autocast():
|
42 |
+
flair_logits = model.get_logits(image=image, text=text)
|
43 |
+
clip_logits = model.get_logits_as_clip(image=image, text=text)
|
44 |
+
|
45 |
+
print("logits get using flair's way:", flair_logits) # [4.4062, 6.9531, -20.5000, -18.1719]
|
46 |
+
print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
|
47 |
+
```
|
48 |
+
|
49 |
+
|
50 |
+
|
51 |
|
52 |
|
53 |
|