Narrator5000's picture
Specify full multi-image inference link.
c656482 verified
|
raw
history blame
3.7 kB
---
base_model: llava-hf/llava-v1.6-mistral-7b-hf
library_name: peft
license: apache-2.0
datasets:
- mirzaei2114/stackoverflowVQA-filtered-small
language:
- en
tags:
- llava
- llava-next
- fine-tuned
- stack-overflow
- qlora
- images
- vqa
- 4bit
---
# Model Card for Model ID
Finetuned LLaVA-Next model for Visual QA on Stack Overflow questions with images.
## Model Details
### Model Description
This model is a finetuned version of **LLaVA-Next (llava-hf/llava-v1.6-mistral-7b-hf)** specifically for visual question answering (VQA)
on Stack Overflow questions containing images. The model was finetuned using **QLoRA** with 4-bit quantization, optimized to handle both
text and image inputs.
The training dataset was filtered from the **mirzaei2114/stackoverflowVQA-filtered-small** dataset.
Only samples with a maximum input length of 1024 (for both question and answer combined) were used. Images were kept
to size to capture detail needed for methods such as optical character recognition.
- **Developed by:** Adam Cassidy
- **Model type:** Visual QA
- **Language(s) (NLP):** EN
- **License:** Apache License, Version 2.0
- **Finetuned from model:** llava-hf/llava-v1.6-mistral-7b-hf
### Model Sources
- **Repository:** [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
## Uses
Drag a snipping rectangle for a screenshot around the exact focus/context for a question related to software development(usually front end)
and accompany it with the question for inference.
### Direct Use
Visual Question Answering (VQA) on technical Stack Overflow (software-adjacent) questions with accompanying images.
### Out-of-Scope Use
General-purpose VQA tasks, though performance on non-technical domains may vary.
## Bias, Risks, and Limitations
Model Capacity: The model was trained using 4-bit QLoRA.
Dataset Size: The training dataset is relatively small, and this may impact generalization to other VQA datasets or domains outside of Stack Overflow.
## How to Get Started with the Model
To use this model, ensure you have the following dependencies installed:
torch==2.4.1+cu121
transformers==4.45.1
Do inference according to this multi-image inference llava-next example: https://huggingface.co/docs/transformers/en/model_doc/llava_next#:~:text=skip_special_tokens%3DTrue))-,Multi%20image%20inference,-LLaVa%2DNext%20can
## Training Details
### Training Data
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/train)
### Training Procedure
#### Training Hyperparameters
TrainingArguments(
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    max_grad_norm=0.1,
    evaluation_strategy="steps",
    eval_steps=15,
    group_by_length=True,
    logging_steps=15,
    gradient_checkpointing=True,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    weight_decay=0.1,
    warmup_steps=10,
    lr_scheduler_type="cosine",
    learning_rate=1e-5,
    save_steps=15,
    save_total_limit=5,
    bf16=True,
    remove_unused_columns=False
)
#### Speeds, Sizes, Times
checkpoint-240
## Evaluation
Evaluation Loss (Pre-finetuning): 2.93
Validation Loss (Post-finetuning): 1.78
### Testing Data, Factors & Metrics
#### Testing Data
[mirzaei2114/stackoverflowVQA-filtered-small](https://huggingface.co/datasets/mirzaei2114/stackoverflowVQA-filtered-small/viewer/default/test)
### Compute Infrastructure
#### Hardware
L4 GPU
#### Software
Google Colab
### Framework versions
- PEFT 0.13.1.dev0
- PyTorch 2.4.1+cu121
- Transformers 4.45.1