xiaorui638 commited on
Commit
607ffa0
·
verified ·
1 Parent(s): 50f4489

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -1
README.md CHANGED
@@ -6,12 +6,48 @@ tags:
6
  - contrastive learning
7
  ---
8
 
9
- FLAIR Model
10
 
11
  Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](https://kim-sanghwan.github.io/), [Mariana-Iuliana Georgescu](https://lilygeorgescu.github.io/), [Zeynep Akata](https://www.eml-munich.de/people/zeynep-akata), [Stephan Alaniz](https://www.eml-munich.de/people/stephan-alaniz)
12
 
13
  FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
 
17
 
 
6
  - contrastive learning
7
  ---
8
 
9
+ **FLAIR Model**
10
 
11
  Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](https://kim-sanghwan.github.io/), [Mariana-Iuliana Georgescu](https://lilygeorgescu.github.io/), [Zeynep Akata](https://www.eml-munich.de/people/zeynep-akata), [Stephan Alaniz](https://www.eml-munich.de/people/stephan-alaniz)
12
 
13
  FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
14
 
15
+ **Usage**
16
+ We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
17
+
18
+ ```python
19
+ import flair
20
+ from PIL import Image
21
+ import torch
22
+
23
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
24
+ print(f"Using device: {device}")
25
+
26
+ pretrained = flair.download_weights_from_hf(model_repo='xiaorui638/flair', filename='flair-cc3m-recap.pt')
27
+ model, _, preprocess = flair.create_model_and_transforms('ViT-B-16-FLAIR', pretrained=pretrained)
28
+
29
+ model.to(device)
30
+ model.eval()
31
+
32
+ tokenizer = flair.get_tokenizer('ViT-B-16-FLAIR')
33
+
34
+ image = preprocess(Image.open("../assets/puppy.jpg")).unsqueeze(0).to(device)
35
+
36
+ text = tokenizer(["In the image, a small white puppy with black ears and eyes is the main subject", # ground-truth caption
37
+ "The white door behind the puppy is closed, and there's a window on the right side of the door", # ground-truth caption
38
+ "A red ladybug is surrounded by green glass beads", # non-ground-truth caption
39
+ "Dominating the scene is a white desk, positioned against a white brick wall"]).to(device) # non-ground-truth caption
40
+
41
+ with torch.no_grad(), torch.cuda.amp.autocast():
42
+ flair_logits = model.get_logits(image=image, text=text)
43
+ clip_logits = model.get_logits_as_clip(image=image, text=text)
44
+
45
+ print("logits get using flair's way:", flair_logits) # [4.4062, 6.9531, -20.5000, -18.1719]
46
+ print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
47
+ ```
48
+
49
+
50
+
51
 
52
 
53