yejunyoon commited on
Commit
e856cef
1 Parent(s): f7100ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -25
README.md CHANGED
@@ -1,53 +1,90 @@
1
- # Model Details
2
- The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to understand news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.
3
 
4
  # Model Date
5
  January 2024
6
 
7
  # Model Type
8
- The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder. These encoders initialized weight for openai/clip-vit-large-patch14 before training. It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.
 
 
9
 
10
  Input: image and text
11
 
12
  output: image and text representation
13
 
14
 
15
- # Intended Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  The model is intended as a research output for research communities.
17
 
18
- # Primary intended uses
19
  The primary intended users of these models are AI researchers.
20
 
21
- # Out-of-Scope Use Cases
22
  The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.
23
 
24
- # Factors
25
- # Environment
26
- This model was trained on a machine equipped with AMD Ryzen Threadripper Pro 5975WX CPU, three Nvidia RTX A6000 GPUs (48GB per GPU), and 256GB RAM. The experiments were conducted on Python 3.9, Pytorch 1.10.1, Transformers 4.29.2, LAVIS 1.0.2, and SentenceTransformer 2.2.2. Five random seeds were used for repeated experiments: 0, 1, 2, 3, and 4. The temperature used for adjusting the masked token prediction is set as 2.0.
27
 
28
- # Card Prompts
29
- # Relevant factors
30
- We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler.The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set. Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.
 
 
31
 
32
- # Evaluation factors
 
33
 
34
- # Metrics
35
  Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.
36
 
37
- Decision thresholds: validation
38
 
39
  Approaches to uncertainty and variability: Measure by changing the random seed 5 times
40
 
41
 
42
- # Data
43
- # Training Data
44
- The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available BBC English Dataset.
45
- The original implementation had two variants: one using a NELA-GT-2021 and the other using the titles instead of summary text from BBC Dataset.
46
 
47
- # Evaluation Data
48
  In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set.
49
- Ethical Considerations
50
- Because CLIP's weights are used, potential bias in CLIP and potential bias in the data used for learning may also be included.
51
-
52
-
53
- # Caveats and Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Model Details
2
+ The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.
3
 
4
  # Model Date
5
  January 2024
6
 
7
  # Model Type
8
+ The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder.
9
+ These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training.
10
+ It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.
11
 
12
  Input: image and text
13
 
14
  output: image and text representation
15
 
16
 
17
+ ## Uses
18
+
19
+ ### Use with Transformers
20
+ ```python
21
+ import torch
22
+ from PIL import Image
23
+ from transformers import AutoModel, AutoProcessor
24
+
25
+ processor = AutoProcessor.from_pretrained("humane-lab/cft-clip")
26
+ model = AutoModel.from_pretrained("humane-lab/cft-clip")
27
+
28
+
29
+ image = "cat.jpg"
30
+ image = Image.open(image)
31
+ inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")
32
+
33
+ outputs = model(**inputs)
34
+ text_embeds = outputs.text_embeds
35
+ image_embeds = outputs.image_embeds
36
+ ```
37
+
38
+ ### Intended Use
39
  The model is intended as a research output for research communities.
40
 
41
+ ### Primary intended uses
42
  The primary intended users of these models are AI researchers.
43
 
44
+ ### Out-of-Scope Use Cases
45
  The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.
46
 
 
 
 
47
 
48
+ ## Factors
49
+ ### Relevant factors
50
+ We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler.
51
+ The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set.
52
+ Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.
53
 
54
+ ### Evaluation factors
55
+ We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation.
56
 
57
+ ## Metrics
58
  Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.
59
 
60
+ Decision thresholds: Validation cosine-similarity based.
61
 
62
  Approaches to uncertainty and variability: Measure by changing the random seed 5 times
63
 
64
 
65
+ ## Data
66
+ ### Training Data
67
+ The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/).
68
+ The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset.
69
 
70
+ ### Evaluation Data
71
  In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set.
72
+ For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24).
73
+
74
+ ## Evaluation
75
+ we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT.
76
+
77
+ |Model|F1|Spearman|
78
+ |---|---|---|
79
+ |CFT-CLIP|**0.815+-0.003**|**0.491+-0.005**|
80
+ |CLIPAdapt|0.767+-0.006|0.459+-0.004|
81
+ |CLIP|0.763|0.409|
82
+ |BLIP|0.737|0.408|
83
+ |BLIP-2|0.707|0.415|
84
+ |BLIP-2+SBERT|0.694|0.341|
85
+
86
+ ## Ethical Considerations
87
+ For pretraining, this study used publicly available news articles shared by news media.
88
+ While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news.
89
+ Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP.
90
+ A user should be cautious about applying the method to problems in a general context and be aware of a potential bias.