LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset

This model is a LoRA (Low-Rank Adaptation) fine-tuned version of lmms-lab/llava-onevision-qwen2-0.5b-ov specialized for photography scene analysis and description generation. The model was presented in the paper Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery. The model was fine-tuned on the DataSeeds.AI Sample Dataset (DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.

Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava

Model Description

  • Base Model: LLaVA-OneVision-Qwen2-0.5b
  • Vision Encoder: SigLIP-SO400M-patch14-384
  • Language Model: Qwen2-0.5B (896M parameters)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation) with PEFT
  • Total Parameters: ~917M (513M trainable during fine-tuning, 56% of total)
  • Multimodal Projector: 1.84M parameters (100% trainable)
  • Precision: BFloat16
  • Task: Photography scene analysis and detailed image description

LoRA Configuration

  • LoRA Rank (r): 32
  • LoRA Alpha: 32
  • LoRA Dropout: 0.1
  • Target Modules: v_proj, k_proj, q_proj, up_proj, gate_proj, down_proj, o_proj
  • Tunable Components: mm_mlp_adapter, mm_language_model

Training Details

Dataset

The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:

  • Compositional elements and camera perspectives
  • Lighting conditions and visual ambiance
  • Product identification and technical details
  • Photographic style and mood analysis

Training Configuration

Parameter Value
Learning Rate 1e-5
Optimizer AdamW
Learning Rate Schedule Cosine decay
Warmup Ratio 0.03
Weight Decay 0.01
Batch Size 2
Gradient Accumulation Steps 8 (effective batch size: 16)
Training Epochs 3
Max Sequence Length 8192
Max Gradient Norm 0.5
Precision BFloat16
Hardware Single NVIDIA A100 40GB
Training Time 30 hours

Training Strategy

  • Validation Frequency: Every 50 steps for precise checkpoint selection
  • Best Checkpoint: Step 1,750 (epoch 2.9) with validation loss of 1.83
  • Mixed Precision: BFloat16 with gradient checkpointing for memory efficiency
  • System Prompt: Consistent template requesting scene descriptions across all samples

Performance

Quantitative Results

The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:

Metric Base Model Fine-tuned Absolute Ξ” Relative Ξ”
BLEU-4 0.0199 0.0246 +0.0048 +24.09%
ROUGE-L 0.2089 0.2140 +0.0051 +2.44%
BERTScore F1 0.2751 0.2789 +0.0039 +1.40%
CLIPScore 0.3247 0.3260 +0.0013 +0.41%

Key Improvements

  • Enhanced N-gram Precision: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
  • Better Sequential Information: ROUGE-L improvement shows enhanced capture of longer matching sequences
  • Improved Semantic Understanding: BERTScore gains demonstrate better contextual relationships
  • Maintained Visual-Semantic Alignment: CLIPScore preservation with slight improvement

Inference Performance

  • Processing Speed: 2.30 seconds per image (NVIDIA A100 40GB)
  • Memory Requirements: Optimized for single GPU inference

Usage

Installation

pip install transformers torch peft pillow

Basic Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image

# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
    "lmms-lab/llava-onevision-qwen2-0.5b-ov",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)

# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."

inputs = processor(prompt, image, return_tensors="pt").to(model.device)

# Generate description
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)

Advanced Usage with Custom Prompts

# Photography-specific prompts that work well with this model
prompts = [
    "Analyze the photographic composition and lighting in this image.",
    "Describe the technical aspects and visual mood of this photograph.",
    "Provide a detailed scene description focusing on the subject and environment."
]

for prompt in prompts:
    inputs = processor(prompt, image, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    description = processor.decode(outputs[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Description: {description}
")

Model Architecture

The model maintains the LLaVA-OneVision architecture with the following components:

  • Vision Encoder: SigLIP-SO400M with hierarchical feature extraction
  • Language Model: Qwen2-0.5B with 24 layers, 14 attention heads
  • Multimodal Projector: 2-layer MLP with GELU activation (mlp2x_gelu)
  • Image Processing: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
  • Context Length: 32,768 tokens with sliding window attention

Technical Specifications

  • Hidden Size: 896
  • Intermediate Size: 4,864
  • Attention Heads: 14 (2 key-value heads)
  • RMS Norm Epsilon: 1e-6
  • RoPE Theta: 1,000,000
  • Image Token Index: 151646
  • Max Image Grid: Up to 2304Γ—2304 pixels with dynamic tiling

Training Data

The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:

  • Scene Descriptions: Detailed textual descriptions of visual content
  • Technical Metadata: Camera settings, composition details
  • Style Analysis: Photographic techniques and artistic elements
  • Quality Annotations: Professional photography standards

The dataset focuses on enhancing the model's ability to:

  • Identify specific products and technical details accurately
  • Describe lighting conditions and photographic ambiance
  • Analyze compositional elements and camera perspectives
  • Generate contextually aware scene descriptions

Limitations and Considerations

Model Limitations

  • Domain Specialization: Optimized for photography; may have reduced performance on general vision-language tasks
  • Base Model Inheritance: Inherits limitations from LLaVA-OneVision base model
  • Training Data Bias: May reflect biases present in the DataSeeds.AI dataset
  • Language Support: Primarily trained and evaluated on English descriptions

Recommended Use Cases

  • βœ… Photography scene analysis and description
  • βœ… Product photography captioning
  • βœ… Technical photography analysis
  • βœ… Visual content generation for photography applications
  • ⚠️ General-purpose vision-language tasks (may have reduced performance)
  • ❌ Non-photographic image analysis (not optimized for this use case)

Ethical Considerations

  • The model may perpetuate biases present in photography datasets
  • Generated descriptions should be reviewed for accuracy in critical applications
  • Consider potential cultural biases in photographic style interpretation

Citation

If you use this model in your research or applications, please cite:

@article{abdoli2025peerranked,
    title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery}, 
    author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
    journal={arXiv preprint arXiv:2506.05673},
    year={2025},
}

@misc{llava-onevision-dsd-finetune-2024,
  title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
  author={DataSeeds.AI},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
  note={LoRA fine-tuned model for enhanced photography description generation}
}

@article{li2024llavaonevision,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
  journal={arXiv preprint arXiv:2408.03326},
  year={2024}
}

@article{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  journal={arXiv preprint arXiv:2106.09685},
  year={2021}
}

License

This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.

Acknowledgments

  • Base Model: Thanks to LMMS Lab for the LLaVA-OneVision model
  • Vision Encoder: Thanks to Google Research for the SigLIP model
  • Dataset: GuruShots photography community for the source imagery
  • Framework: Hugging Face PEFT library for efficient fine-tuning capabilities

For questions, issues, or collaboration opportunities, please visit the model repository or contact the DataSeeds.AI team.

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune

Adapter
(1)
this model

Dataset used to train Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune

Evaluation results