BLIP Image Captioning

Model Description

BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images.

Model Details

  • Model Architecture: BLIP (Bootstrapping Language-Image Pre-training)
  • Base Model: Salesforce/blip-image-captioning-base
  • Fine-tuning Dataset: mouwiya/image-in-words400
  • Number of Parameters: 109 million

Training Data

The model was fine-tuned on a shuffled and subsetted version of the "image-in-words400" dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development.

Training Procedure

  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 16
  • Epochs: 3
  • Evaluation Metric: BLEU Score

Usage

To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO

# Load the processor and model
model_name = "Mouwiya/BLIP_image_captioning"
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

# Example usage
image_url = "URL_OF_THE_IMAGE"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)

Evaluation

The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows:

  • Average BLEU Score: 0.35 This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams.

Limitations

  • Dataset Size: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities.
  • Domain-Specific: This model was trained on a specific dataset and may not perform as well on images from different domains.

Contact

Mouwiya S. A. Al-Qaisieh mo3awiya@gmail.com

Downloads last month
13
Safetensors
Model size
247M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Mouwiya/BLIP_image_captioning