|
--- |
|
base_model: llava-hf/llava-v1.6-mistral-7b-hf |
|
library_name: peft |
|
license: apache-2.0 |
|
datasets: |
|
- mirzaei2114/stackoverflowVQA-filtered-small |
|
language: |
|
- en |
|
tags: |
|
- llava |
|
- llava-next |
|
- fine-tuned |
|
- stack-overflow |
|
- qlora |
|
- images |
|
- vqa |
|
- 4bit |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA) |
|
on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both |
|
text and image inputs. |
|
|
|
The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset. |
|
Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept |
|
to size to capture detail needed for methods such as optical character recognition. |
|
|
|
|
|
|
|
- **Developed by:** Adam Cassidy |
|
- **Model type:** Visual QA |
|
- **Language(s) (NLP):** EN |
|
- **License:** Apache License, Version 2.0 |
|
- **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) |
|
|
|
## Uses |
|
|
|
Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end) |
|
and accompany it with the question for inference. |
|
|
|
### Direct Use |
|
|
|
Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images. |
|
|
|
### Out-of-Scope Use |
|
|
|
General-purpose VQA tasks, though performance on non-technical domains may vary. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Model Capacity: The model was trained using 4-bit QLoRA. |
|
Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow. |
|
|
|
## How to Get Started with the Model |
|
|
|
To use this model, ensure you have the following dependencies installed: |
|
torch==2.4.1+cu121 |
|
transformers==4.45.1 |
|
|
|
Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train) |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
|
|
TrainingArguments( |
|
per_device_train_batch_size=4, |
|
per_device_eval_batch_size=4, |
|
max_grad_norm=0.1, |
|
evaluation_strategy="steps", |
|
eval_steps=15, |
|
group_by_length=True, |
|
logging_steps=15, |
|
gradient_checkpointing=True, |
|
gradient_accumulation_steps=2, |
|
num_train_epochs=3, |
|
weight_decay=0.1, |
|
warmup_steps=10, |
|
lr_scheduler_type="cosine", |
|
learning_rate=1e-5, |
|
save_steps=15, |
|
save_total_limit=5, |
|
bf16=True, |
|
remove_unused_columns=False |
|
) |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
checkpoint-240 |
|
|
|
## Evaluation |
|
|
|
Evaluation Loss (Pre-finetuning): 2.93 |
|
Validation Loss (Post-finetuning): 1.78 |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test) |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
L4 GPU |
|
|
|
#### Software |
|
|
|
Google Colab |
|
|
|
### Framework versions |
|
|
|
- PEFT 0.13.1.dev0 |
|
- PyTorch 2.4.1+cu121 |
|
- Transformers 4.45.1 |