Model Details

This model is a fine-tuned version of the LLaVA-v1.5-7B language model, which has been adapted to work with a custom Historical Paintings Dataset. The fine-tuning process utilized PEFT (Parameter-Efficient Fine-Tuning) LoRA and DeepSpeed to reduce the number of trainable parameters and efficiently utilize GPU resources.

Dataset

The dataset used for fine-tuning is a collection of famous historical paintings/arts from artists like Leonardo da vinci, or Von Aachen. The dataset consists of 3k instances of image-text pairs. Given below is sample of text used in the data. Each instance contains image id as well as image path, which is important for llava.

  {
        "id": "data_0001",
        "image": "images/dataset/1.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "What is this image?"
            },
            {
                "from": "gpt",
                "value": "The Procuring Scene by Hans von Aachen is a captivating masterpiece that showcases the artists exceptional talent in depicting the nuances of human behavior and social dynamics. With remarkable attention to detail von Aachen portrays a scene of seduction and illicit liaisons subtly hinting at the undercurrents of desire and power play that permeated the elite circles of his time. Through his deft brushstrokes and skillful "
            }
        ]
    },

How to use?

Note - Don't use the model with the transformers 'Use this model' on huggingface, alternatively follow the belows step wise approach for inferencing this model.

The folder 'llava-v1.5-7b-task-lora' contains the lora weights and the folder 'llava-ftmodel' contains the merged model weights and configurations.

To use the model:

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

Now, Place the folder 'llava-ftmodel' (this repo) in 'LLaVA' directory
Make sure transformers version is 4.37.2!
Now, place the 'test.jpg' from this repo, in the 'LLaVA' directory (To use it as a test image)

Now run the following command:

python -m llava.serve.cli --model-path 'llava-ftmodel' --image-file 'test.jpg'

The model will ask for Human input, Type 'Describe this image' or 'What is depicted in this figure?' and hit enter! ENJOY!

Model key metrices

"train/global_step": 940,
"train/train_samples_per_second": 7.443,
"_step": 940,
"train/loss": 0.1388,
"train/epoch": 5,

Intended Use

The fine-tuned LLaVA model is designed for tasks related to historical paintings, such as image captioning, visual question answering, and multimodal understanding. It can be used by researchers, historians, and enthusiasts interested in exploring and analyzing historical artworks.

Fine Tuning Procedure

The model was fine-tuned using NVIDIA A40 GPU, with 48 GB of VRAM. The training process leveraged the efficiency of PEFT LoRA and DeepSpeed to optimize the use of GPU resources and minimize the number of trainable parameters. Once the new lora weights were trained, they were merged to the original model weights. After fine-tuning, the model achieved a final loss value of 0.13

Performance

The fine-tuned LLaVA model has demonstrated improved performance on tasks related to historical paintings compared to the original LLaVA-v1.5-7B model. However, the exact performance metrics and benchmarks are not provided in this model card.

Limitations and Biases

As with any language model, the fine-tuned LLaVA model may exhibit biases present in the training data, which could include historical, cultural, or societal biases. Additionally, the model's performance may be limited by the quality and diversity of the Historical Paintings Dataset used for fine-tuning.

Ethical Considerations

Users of this model should be aware of potential ethical implications, such as the use of historical artworks without proper attribution or consent. It is essential to respect intellectual property rights and ensure that any generated content or analyses are used responsibly and respectfully.