LLaVA-Next-OneVision-0.5B-levircc

CAUTION ONLY WORKS GOOD ON RS CHANGE CAPTIONING

This model works with two remote images and interprets changes between those images. At other cases this model won't work any good.

Model Overview
Installation
Usage
Performance Scores
Citation

Model Overview

Name: LLaVA-Next-OneVision-0.5B-levircc
Model Type: Vision-Language Foundation Model
Parameters: ~0.9B
Purpose: Change Captioning
Pretrained On: Levir-CC dataset

This model is trained with levircc dataset using pretrained weights of llava-next-onevision-qwen2-0.5b-ov model.

Installation

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT && pip install -e ".[train]"
cd LLaVA-NeXT && pip install --upgrade pip && pip install -e .

Usage

Below is an example of how to use this model:

import sys
import os

scriptpath = "path to your llava-next folder if code is not running out of llava-next folder"
sys.path.append(os.path.abspath(scriptpath))

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import copy
import torch
import warnings

def get_device_map() -> str:
  return 'cuda' if torch.cuda.is_available() else 'cpu'

device = get_device_map()
pretrained = "turabimi4/llava-next-onevision-0.5b-levircc"
model_name = "llava_qwen"
device_map = "auto"
llava_model_args = {
        "multimodal": True,
        "attn_implementation": 'sdpa'
    }
overwrite_config = {}
overwrite_config["image_aspect_ratio"] = "pad"
llava_model_args["overwrite_config"] = overwrite_config
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)
model.to(device)
model.eval()

def generate_answer_from_images(image_path1, image_path2, question, model, tokenizer, image_processor, conv_template, device):
    try:
        # Load and preprocess the images
        image1 = Image.open(image_path1).convert('RGB')
        image2 = Image.open(image_path2).convert('RGB')
    except FileNotFoundError as e:
        print(f"Image not found: {e}")
        return None

    images = [image1, image2]

    # Process each image separately and append tensors together
    image_tensors = []
    image_sizes = []
    for image in images:
        # Process the image individually
        image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
        image_tensors.append(image_tensor.half().cuda())
        image_sizes.append(image.size)
        question = f"<image> {question}"

    # Replace <image> token in the question with DEFAULT_IMAGE_TOKEN
    question_parts = question.split('<image>')
    formatted_question = ''
    for i, part in enumerate(question_parts):
        formatted_question += part
        if i < 2:  # Insert DEFAULT_IMAGE_TOKEN for the two images
            formatted_question += DEFAULT_IMAGE_TOKEN

    # Prepare the conversation
    conv = copy.deepcopy(conv_templates[conv_template])
    conv.append_message(conv.roles[0], formatted_question)
    conv.append_message(conv.roles[1], None)
    prompt_question = conv.get_prompt()

    # Prepare input_ids
    input_ids = tokenizer_image_token(
        prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

    # Generate the answer
    with torch.no_grad():
        cont = model.generate(
            input_ids,
            images=image_tensors,
            image_sizes=image_sizes,
            do_sample=False,
            temperature=0,
            max_new_tokens=64
        )
    text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
    generated_answer = text_outputs[0].strip()

    return generated_answer


# Example Usage
conv_template = "qwen_2"  # Ensure you use the correct chat template for different models
image_path1 = "path/to/first/image.jpg"
image_path2 = "path/to/second/image.jpg"
question = "What is the difference in two images?"
result = generate_answer_from_images(image_path1, image_path2, question, model, tokenizer, image_processor, conv_template, device)
print("Generated Answer:", result)

Performance Scores

Below are the performance metrics of the LLaVA-Next-OneVision-0.5B model on standard vision-language benchmarks:

Dataset	Metric	Score
LevirCC	BLEU-1	85.78
	BLEU-2	75.61
	BLEU-3	67.52
	BLEU-4	60.67
	METEOR	40.32
	ROUGE_L	78.28
	CIDEr	137.46

Note: These scores are indicative and may vary depending on evaluation protocol, preprocessing, or fine-tuning strategies.

Examples

Hypothesis: the forest is replaced by a road with houses built along .

References:

a winding road is built and many houses are constructed beside it to replace the former vegetation .
trees are removed and an arc road is built with some buildings around .
the former forest has been replaced by a residential area with roads and houses .
the vegetation has been replaced by a road and many villas around .
forest is taken by the houses along the road .

Hypothesis: the vegetation has been replaced by a road and many houses .

References:

the road has been rebuilt and the plants have been replaced by many villas .
a large number of trees and space are replaced by a residential area and road .
trees and the road are replaced by a new road with neatly arranged residential buildings and another circular road at the top .
a road is built and many villas are constructed on both sides of it and the vegetation is also replaced by houses .
the woods with a road going through have turned into the residential area with roads and houses .

Citation

If you use this model in your research or applications, please cite it as follows:

@ARTICLE{9934924,
  author={Liu, Chenyang and Zhao, Rui and Chen, Hao and Zou, Zhengxia and Shi, Zhenwei},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset}, 
  year={2022},
  volume={60},
  number={},
  pages={1-20},
  doi={10.1109/TGRS.2022.3218921}}

turabimi4
/

llava-next-onevision-0.5b-levircc