LLaVA-Next-OneVision-0.5B-levircc
CAUTION ONLY WORKS GOOD ON RS CHANGE CAPTIONING
This model works with two remote images and interprets changes between those images. At other cases this model won't work any good.
Table of Contents
Model Overview
- Name: LLaVA-Next-OneVision-0.5B-levircc
- Model Type: Vision-Language Foundation Model
- Parameters: ~0.9B
- Purpose: Change Captioning
- Pretrained On: Levir-CC dataset
This model is trained with levircc dataset using pretrained weights of llava-next-onevision-qwen2-0.5b-ov model.
Installation
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT && pip install -e ".[train]"
cd LLaVA-NeXT && pip install --upgrade pip && pip install -e .
Usage
Below is an example of how to use this model:
import sys
import os
scriptpath = "path to your llava-next folder if code is not running out of llava-next folder"
sys.path.append(os.path.abspath(scriptpath))
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import copy
import torch
import warnings
def get_device_map() -> str:
return 'cuda' if torch.cuda.is_available() else 'cpu'
device = get_device_map()
pretrained = "turabimi4/llava-next-onevision-0.5b-levircc"
model_name = "llava_qwen"
device_map = "auto"
llava_model_args = {
"multimodal": True,
"attn_implementation": 'sdpa'
}
overwrite_config = {}
overwrite_config["image_aspect_ratio"] = "pad"
llava_model_args["overwrite_config"] = overwrite_config
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)
model.to(device)
model.eval()
def generate_answer_from_images(image_path1, image_path2, question, model, tokenizer, image_processor, conv_template, device):
try:
# Load and preprocess the images
image1 = Image.open(image_path1).convert('RGB')
image2 = Image.open(image_path2).convert('RGB')
except FileNotFoundError as e:
print(f"Image not found: {e}")
return None
images = [image1, image2]
# Process each image separately and append tensors together
image_tensors = []
image_sizes = []
for image in images:
# Process the image individually
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
image_tensors.append(image_tensor.half().cuda())
image_sizes.append(image.size)
question = f"<image> {question}"
# Replace <image> token in the question with DEFAULT_IMAGE_TOKEN
question_parts = question.split('<image>')
formatted_question = ''
for i, part in enumerate(question_parts):
formatted_question += part
if i < 2: # Insert DEFAULT_IMAGE_TOKEN for the two images
formatted_question += DEFAULT_IMAGE_TOKEN
# Prepare the conversation
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], formatted_question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
# Prepare input_ids
input_ids = tokenizer_image_token(
prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
# Generate the answer
with torch.no_grad():
cont = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=64
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
generated_answer = text_outputs[0].strip()
return generated_answer
# Example Usage
conv_template = "qwen_2" # Ensure you use the correct chat template for different models
image_path1 = "path/to/first/image.jpg"
image_path2 = "path/to/second/image.jpg"
question = "What is the difference in two images?"
result = generate_answer_from_images(image_path1, image_path2, question, model, tokenizer, image_processor, conv_template, device)
print("Generated Answer:", result)
Performance Scores
Below are the performance metrics of the LLaVA-Next-OneVision-0.5B model on standard vision-language benchmarks:
Dataset | Metric | Score |
---|---|---|
LevirCC | BLEU-1 | 85.78 |
BLEU-2 | 75.61 | |
BLEU-3 | 67.52 | |
BLEU-4 | 60.67 | |
METEOR | 40.32 | |
ROUGE_L | 78.28 | |
CIDEr | 137.46 |
Note: These scores are indicative and may vary depending on evaluation protocol, preprocessing, or fine-tuning strategies.
Examples
Hypothesis: the forest is replaced by a road with houses built along .
References:
- a winding road is built and many houses are constructed beside it to replace the former vegetation .
- trees are removed and an arc road is built with some buildings around .
- the former forest has been replaced by a residential area with roads and houses .
- the vegetation has been replaced by a road and many villas around .
- forest is taken by the houses along the road .
Hypothesis: the vegetation has been replaced by a road and many houses .
References:
- the road has been rebuilt and the plants have been replaced by many villas .
- a large number of trees and space are replaced by a residential area and road .
- trees and the road are replaced by a new road with neatly arranged residential buildings and another circular road at the top .
- a road is built and many villas are constructed on both sides of it and the vegetation is also replaced by houses .
- the woods with a road going through have turned into the residential area with roads and houses .
Citation
If you use this model in your research or applications, please cite it as follows:
@ARTICLE{9934924,
author={Liu, Chenyang and Zhao, Rui and Chen, Hao and Zou, Zhengxia and Shi, Zhenwei},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset},
year={2022},
volume={60},
number={},
pages={1-20},
doi={10.1109/TGRS.2022.3218921}}
- Downloads last month
- 2