--- language: - en --- ## Model Details The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining. # Model Date January 2024 # Model Type The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder. These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training. It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss. Input: image and text output: image and text representation ## Uses ### Use with Transformers ```python import torch from PIL import Image from transformers import AutoModel, AutoProcessor processor = AutoProcessor.from_pretrained("humane-lab/cft-clip") model = AutoModel.from_pretrained("humane-lab/cft-clip") image = "cat.jpg" image = Image.open(image) inputs = processor(text=["this is a cat"], images=image, return_tensors="pt") outputs = model(**inputs) text_embeds = outputs.text_embeds image_embeds = outputs.image_embeds ``` ### Intended Use The model is intended as a research output for research communities. ### Primary intended uses The primary intended users of these models are AI researchers. ### Out-of-Scope Use Cases The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases. ## Factors ### Relevant factors We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler. The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set. Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations. ### Evaluation factors We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation. ## Metrics Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels. Decision thresholds: Validation cosine-similarity based. Approaches to uncertainty and variability: Measure by changing the random seed 5 times ## Data ### Training Data The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/). The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset. ### Evaluation Data In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set. For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24). ## Evaluation we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT. |Model|F1|Spearman| |---|---|---| |CFT-CLIP|**0.815+-0.003**|**0.491+-0.005**| |CLIPAdapt|0.767+-0.006|0.459+-0.004| |CLIP|0.763|0.409| |BLIP|0.737|0.408| |BLIP-2|0.707|0.415| |BLIP-2+SBERT|0.694|0.341| ## Ethical Considerations For pretraining, this study used publicly available news articles shared by news media. While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news. Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP. A user should be cautious about applying the method to problems in a general context and be aware of a potential bias.