File size: 3,701 Bytes
5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 0c7b145 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 c656482 960e2e6 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 ce85b24 5127524 0c1116b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
- mirzaei2114/stackoverflowVQA-filtered-small
language:
- en
tags:
- llava
- llava-next
- fine-tuned
- stack-overflow
- qlora
- images
- vqa
- 4bit
---
# Model Card for Model ID
Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.
## Model Details
### Model Description
This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA)
on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both
text and image inputs.
The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset.
Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept
to size to capture detail needed for methods such as optical character recognition.
- **Developed by:** Adam Cassidy
- **Model type:** Visual QA
- **Language(s) (NLP):** EN
- **License:** Apache License, Version 2.0
- **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf
### Model Sources
- **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
## Uses
Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end)
and accompany it with the question for inference.
### Direct Use
Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.
### Out-of-Scope Use
General-purpose VQA tasks, though performance on non-technical domains may vary.
## Bias, Risks, and Limitations
Model Capacity: The model was trained using 4-bit QLoRA.
Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.
## How to Get Started with the Model
To use this model, ensure you have the following dependencies installed:
torch==2.4.1+cu121
transformers==4.45.1
Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can
## Training Details
### Training Data
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train)
### Training Procedure
#### Training Hyperparameters
TrainingArguments(
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
max_grad_norm=0.1,
evaluation_strategy="steps",
eval_steps=15,
group_by_length=True,
logging_steps=15,
gradient_checkpointing=True,
gradient_accumulation_steps=2,
num_train_epochs=3,
weight_decay=0.1,
warmup_steps=10,
lr_scheduler_type="cosine",
learning_rate=1e-5,
save_steps=15,
save_total_limit=5,
bf16=True,
remove_unused_columns=False
)
#### Speeds, Sizes, Times
checkpoint-240
## Evaluation
Evaluation Loss (Pre-finetuning): 2.93
Validation Loss (Post-finetuning): 1.78
### Testing Data, Factors & Metrics
#### Testing Data
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test)
### Compute Infrastructure
#### Hardware
L4 GPU
#### Software
Google Colab
### Framework versions
- PEFT 0.13.1.dev0
- PyTorch 2.4.1+cu121
- Transformers 4.45.1 |