File size: 3,701 Bytes

---
base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
- mirzaei2114/stackoverflowVQA-filtered-small
language:
- en
tags:
- llava
- llava-next
- fine-tuned
- stack-overflow
- qlora
- images
- vqa
- 4bit
---

# Model Card for Model ID

Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.

## Model Details

### Model Description

This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA)
on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both
text and image inputs.

The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset.
Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept
to size to capture detail needed for methods such as optical character recognition.



- **Developed by:** Adam Cassidy
- **Model type:** Visual QA
- **Language(s) (NLP):** EN
- **License:** Apache License, Version 2.0
- **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf

### Model Sources

- **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)

## Uses

Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end)
and accompany it with the question for inference.

### Direct Use

Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.

### Out-of-Scope Use

General-purpose VQA tasks, though performance on non-technical domains may vary.

## Bias, Risks, and Limitations

Model Capacity: The model was trained using 4-bit QLoRA.
Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.

## How to Get Started with the Model

To use this model, ensure you have the following dependencies installed:
torch==2.4.1+cu121
transformers==4.45.1

Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can

## Training Details

### Training Data

[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train)

### Training Procedure

#### Training Hyperparameters

  TrainingArguments(
      per_device_train_batch_size=4,
      per_device_eval_batch_size=4,
      max_grad_norm=0.1,
      evaluation_strategy="steps",
      eval_steps=15,
      group_by_length=True,
      logging_steps=15,
      gradient_checkpointing=True,
      gradient_accumulation_steps=2,
      num_train_epochs=3,
      weight_decay=0.1,
      warmup_steps=10,
      lr_scheduler_type="cosine",
      learning_rate=1e-5,
      save_steps=15,
      save_total_limit=5,
      bf16=True,
      remove_unused_columns=False
  )

#### Speeds, Sizes, Times

checkpoint-240

## Evaluation

Evaluation Loss (Pre-finetuning): 2.93
Validation Loss (Post-finetuning): 1.78

### Testing Data, Factors & Metrics

#### Testing Data

[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test)

### Compute Infrastructure

#### Hardware

L4 GPU

#### Software

Google Colab

### Framework versions

- PEFT 0.13.1.dev0
- PyTorch 2.4.1+cu121
- Transformers 4.45.1