BLIP2Typhoon-Captioning

Model Description

BLIP2Typhoon-Captioning is a powerful image captioning model designed to generate descriptive captions for images. This model leverages the strengths of both the BLIP2 and Typhoon architectures to provide high-quality, contextually accurate descriptions. The base models used are:

Encoder: Salesforce/blip2-opt-2.7b-coco
Decoder: scb10x/llama-3-typhoon-v1.5x-8b-instruct

The BLIP2 encoder extracts visual features from images, while the Typhoon decoder generates natural language descriptions based on these features.

Training Data

This model was trained on the COCO 2017 dataset, a widely-used benchmark dataset for image captioning tasks. The dataset includes a diverse set of images along with multiple human-generated captions for each image, enabling the model to learn rich and varied descriptive capabilities.

Training Details

Datasets: COCO 2017
Encoder: Salesforce/blip2-opt-2.7b-coco
Decoder: scb10x/llama-3-typhoon-v1.5x-8b-instruct
Training Framework: Hugging Face Transformers
Hardware: High-performance GPUs for efficient training

Usage

The BLIP2Typhoon-Captioning model can be used to generate captions for a wide variety of images. Here's how to use the model:

from PIL import Image
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# Load the processor and the model
processor = Blip2Processor.from_pretrained("MagiBoss/BLIP2Typhoon-Captioning")
model = Blip2ForConditionalGeneration.from_pretrained("MagiBoss/BLIP2Typhoon-Captioning", torch_dtype=torch.bfloat16)

# Prepare an image
image = Image.open("Your image...").convert("RGB")

# Generate a caption
inputs = processor(images=image, return_tensors="pt", padding=True).to(device, torch.bfloat16)
outputs = model.generate(**inputs, max_length=30, pad_token_id=processor.tokenizer.pad_token_id)
caption = processor.batch_decode(outputs, skip_special_tokens=True)

print("Generated Caption:", caption)

Performance

The BLIP2Typhoon-Captioning model achieves state-of-the-art performance on the COCO 2017 dataset, providing high-quality captions that are both accurate and descriptive.

Limitations and Future Work

While the model performs well on a wide range of images, there are limitations to its understanding and generation capabilities, especially in cases involving abstract concepts or highly specialized knowledge. Future work may include fine-tuning the model on more diverse datasets or integrating additional contextual information to enhance caption generation.

Acknowledgements

This model is built upon the work of Salesforce and Typhoon teams. The COCO dataset was instrumental in training this model.

Citation

If you use this model in your research, please cite:

@misc{BLIP2Typhoon-Captioning,
  author = {MagiBoss},
  title = {BLIP2Typhoon-Captioning},
  year = {2024},
  publisher = {Hugging Face},
  note = {https://huggingface.co/MagiBoss/BLIP2Typhoon-Captioning}
}