LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of lmms-lab/llava-onevision-qwen2-0.5b-ov specialized for photography scene analysis and description generation. The model was presented in the paper Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery. The model was fine-tuned on the DataSeeds.AI Sample Dataset (DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava
Model Description
- Base Model: LLaVA-OneVision-Qwen2-0.5b
- Vision Encoder: SigLIP-SO400M-patch14-384
- Language Model: Qwen2-0.5B (896M parameters)
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with PEFT
- Total Parameters: ~917M (513M trainable during fine-tuning, 56% of total)
- Multimodal Projector: 1.84M parameters (100% trainable)
- Precision: BFloat16
- Task: Photography scene analysis and detailed image description
LoRA Configuration
- LoRA Rank (r): 32
- LoRA Alpha: 32
- LoRA Dropout: 0.1
- Target Modules:
v_proj
,k_proj
,q_proj
,up_proj
,gate_proj
,down_proj
,o_proj
- Tunable Components:
mm_mlp_adapter
,mm_language_model
Training Details
Dataset
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
- Compositional elements and camera perspectives
- Lighting conditions and visual ambiance
- Product identification and technical details
- Photographic style and mood analysis
Training Configuration
Parameter | Value |
---|---|
Learning Rate | 1e-5 |
Optimizer | AdamW |
Learning Rate Schedule | Cosine decay |
Warmup Ratio | 0.03 |
Weight Decay | 0.01 |
Batch Size | 2 |
Gradient Accumulation Steps | 8 (effective batch size: 16) |
Training Epochs | 3 |
Max Sequence Length | 8192 |
Max Gradient Norm | 0.5 |
Precision | BFloat16 |
Hardware | Single NVIDIA A100 40GB |
Training Time | 30 hours |
Training Strategy
- Validation Frequency: Every 50 steps for precise checkpoint selection
- Best Checkpoint: Step 1,750 (epoch 2.9) with validation loss of 1.83
- Mixed Precision: BFloat16 with gradient checkpointing for memory efficiency
- System Prompt: Consistent template requesting scene descriptions across all samples
Performance
Quantitative Results
The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:
Metric | Base Model | Fine-tuned | Absolute Ξ | Relative Ξ |
---|---|---|---|---|
BLEU-4 | 0.0199 | 0.0246 | +0.0048 | +24.09% |
ROUGE-L | 0.2089 | 0.2140 | +0.0051 | +2.44% |
BERTScore F1 | 0.2751 | 0.2789 | +0.0039 | +1.40% |
CLIPScore | 0.3247 | 0.3260 | +0.0013 | +0.41% |
Key Improvements
- Enhanced N-gram Precision: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
- Better Sequential Information: ROUGE-L improvement shows enhanced capture of longer matching sequences
- Improved Semantic Understanding: BERTScore gains demonstrate better contextual relationships
- Maintained Visual-Semantic Alignment: CLIPScore preservation with slight improvement
Inference Performance
- Processing Speed: 2.30 seconds per image (NVIDIA A100 40GB)
- Memory Requirements: Optimized for single GPU inference
Usage
Installation
pip install transformers torch peft pillow
Basic Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
"lmms-lab/llava-onevision-qwen2-0.5b-ov",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)
# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
# Generate description
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)
Advanced Usage with Custom Prompts
# Photography-specific prompts that work well with this model
prompts = [
"Analyze the photographic composition and lighting in this image.",
"Describe the technical aspects and visual mood of this photograph.",
"Provide a detailed scene description focusing on the subject and environment."
]
for prompt in prompts:
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Description: {description}
")
Model Architecture
The model maintains the LLaVA-OneVision architecture with the following components:
- Vision Encoder: SigLIP-SO400M with hierarchical feature extraction
- Language Model: Qwen2-0.5B with 24 layers, 14 attention heads
- Multimodal Projector: 2-layer MLP with GELU activation (mlp2x_gelu)
- Image Processing: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
- Context Length: 32,768 tokens with sliding window attention
Technical Specifications
- Hidden Size: 896
- Intermediate Size: 4,864
- Attention Heads: 14 (2 key-value heads)
- RMS Norm Epsilon: 1e-6
- RoPE Theta: 1,000,000
- Image Token Index: 151646
- Max Image Grid: Up to 2304Γ2304 pixels with dynamic tiling
Training Data
The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:
- Scene Descriptions: Detailed textual descriptions of visual content
- Technical Metadata: Camera settings, composition details
- Style Analysis: Photographic techniques and artistic elements
- Quality Annotations: Professional photography standards
The dataset focuses on enhancing the model's ability to:
- Identify specific products and technical details accurately
- Describe lighting conditions and photographic ambiance
- Analyze compositional elements and camera perspectives
- Generate contextually aware scene descriptions
Limitations and Considerations
Model Limitations
- Domain Specialization: Optimized for photography; may have reduced performance on general vision-language tasks
- Base Model Inheritance: Inherits limitations from LLaVA-OneVision base model
- Training Data Bias: May reflect biases present in the DataSeeds.AI dataset
- Language Support: Primarily trained and evaluated on English descriptions
Recommended Use Cases
- β Photography scene analysis and description
- β Product photography captioning
- β Technical photography analysis
- β Visual content generation for photography applications
- β οΈ General-purpose vision-language tasks (may have reduced performance)
- β Non-photographic image analysis (not optimized for this use case)
Ethical Considerations
- The model may perpetuate biases present in photography datasets
- Generated descriptions should be reviewed for accuracy in critical applications
- Consider potential cultural biases in photographic style interpretation
Citation
If you use this model in your research or applications, please cite:
@article{abdoli2025peerranked,
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
journal={arXiv preprint arXiv:2506.05673},
year={2025},
}
@misc{llava-onevision-dsd-finetune-2024,
title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
author={DataSeeds.AI},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
note={LoRA fine-tuned model for enhanced photography description generation}
}
@article{li2024llavaonevision,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
journal={arXiv preprint arXiv:2408.03326},
year={2024}
}
@article{hu2022lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
journal={arXiv preprint arXiv:2106.09685},
year={2021}
}
License
This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.
Acknowledgments
- Base Model: Thanks to LMMS Lab for the LLaVA-OneVision model
- Vision Encoder: Thanks to Google Research for the SigLIP model
- Dataset: GuruShots photography community for the source imagery
- Framework: Hugging Face PEFT library for efficient fine-tuning capabilities
For questions, issues, or collaboration opportunities, please visit the model repository or contact the DataSeeds.AI team.
- Downloads last month
- 49
Model tree for Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
Base model
lmms-lab/llava-onevision-qwen2-0.5b-ovDataset used to train Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
Evaluation results
- BLEU-4 on DataSeeds.AI Sample Datasetself-reported0.025
- ROUGE-L on DataSeeds.AI Sample Datasetself-reported0.214
- BERTScore F1 on DataSeeds.AI Sample Datasetself-reported0.279
- CLIPScore on DataSeeds.AI Sample Datasetself-reported0.326