asaakyan's picture
Update README.md
2aefc14 verified
---
library_name: transformers
tags:
- art
datasets:
- ColumbiaNLP/V-FLUTE
language:
- en
metrics:
- f1
---
# Model Card for Model ID
This is the checkpoint for the model from the paper [V-FLUTE: Visual Figurative Language Understanding with Textual Explanations](https://arxiv.org/abs/2405.01474).
Specifically, it is the best performing fine-tuned model on a combination of V-FLUTE and e-ViL (e-SNLI-VE) datasets with early stopping based on the V-FLUTE validation set.
## Model Details
### Model Description
See more on LLaVA 1.5 here: https://github.com/haotian-liu/LLaVA
V-FLUTE dataset: https://huggingface.co/datasets/ColumbiaNLP/V-FLUTE
V-FLUTE paper: https://arxiv.org/abs/2405.01474
Citation:
```
@misc{saakyan2024vflute,
title={V-FLUTE: Visual Figurative Language Understanding with Textual Explanations},
author={Arkadiy Saakyan and Shreyas Kulkarni and Tuhin Chakrabarty and Smaranda Muresan},
year={2024},
eprint={2405.01474},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Arkadiy Saakyan (ColumbiaNLP)
- **Model type:** Vision-Language Model
- **Language(s) (NLP):** English
- **Finetuned from model [optional]:** LLaVA-v1.5
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/asaakyan/V-FLUTE
- **Paper [optional]:** https://arxiv.org/abs/2405.01474
## Uses
The model's intended use is limited to interpreting multimodal figurative inputs such as metaphors, similes, idioms, sarcasm, and humor.
### Out-of-Scope Use
The model may not work well for other general instruction-following usecases.
[More Information Needed]
## Bias, Risks, and Limitations
The V-FLUTE dataset or its source datasets may contain bias, especially in datasets reflecting user-generated distributions (memecap and muse).
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Install LLaVA as described here: https://github.com/asaakyan/LLaVA/tree/6f595efcf2699884f18957ee603986cebfaa9df7
```
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava_mod import eval_model
model_base = "llava-v1.5-7b"
model_dir = "llava-v1.5-7b-evil-vflue-v2-lora"
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=model_base,
model_name=model_name,
load_4bit=False
)
prompt = """Does the illustration affirm or contest the claim "Feeling motivated and energetic after only cleaning a room minimally."? Provide your argument and choose a label: entailment or contradiction."""
image_file = f"{image_path}/27.png"
infer_args = type('Args', (), {
"model_name": model_name,
"model": model,
"tokenizer": tokenizer,
"image_processor": image_processor,
"query": prompt,
"conv_mode": None,
"image_file": image_file,
"sep": ",",
"temperature": 0,
"top_p": None,
"num_beams": 3,
"max_new_tokens": 512
})()
output = eval_model(infer_args)
print(output)
```
## Training Details
See [here](https://github.com/asaakyan/LLaVA/tree/6f595efcf2699884f18957ee603986cebfaa9df7/scripts/vflute)
or [here](https://github.com/asaakyan/V-FLUTE)
### Training Data
https://huggingface.co/datasets/ColumbiaNLP/V-FLUTE
## Model Card Contact
a.saakyan@cs.columbia.edu