--- library_name: transformers pipeline_tag: image-to-text datasets: - Mouwiya/image-in-Words400 --- # BLIP Image Captioning ## Model Description BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images. ## Model Details - **Model Architecture**: BLIP (Bootstrapping Language-Image Pre-training) - **Base Model**: Salesforce/blip-image-captioning-base - **Fine-tuning Dataset**: mouwiya/image-in-words400 - **Number of Parameters**: 109 million ## Training Data The model was fine-tuned on a shuffled and subsetted version of the **"image-in-words400"** dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development. ## Training Procedure - **Optimizer**: AdamW - **Learning Rate**: 2e-5 - **Batch Size**: 16 - **Epochs**: 3 - **Evaluation Metric**: BLEU Score ## Usage To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below: ```python from transformers import BlipProcessor, BlipForConditionalGeneration from PIL import Image import requests from io import BytesIO # Load the processor and model model_name = "Mouwiya/BLIP_image_captioning" processor = BlipProcessor.from_pretrained(model_name) model = BlipForConditionalGeneration.from_pretrained(model_name) # Example usage image_url = "URL_OF_THE_IMAGE" response = requests.get(image_url) image = Image.open(BytesIO(response.content)).convert("RGB") inputs = processor(images=image, return_tensors="pt") outputs = model.generate(**inputs) caption = processor.decode(outputs[0], skip_special_tokens=True) print(caption) ``` ## Evaluation The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows: - **Average BLEU Score**: 0.35 This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams. ## Limitations - **Dataset Size**: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities. - **Domain-Specific**: This model was trained on a specific dataset and may not perform as well on images from different domains. ## Contact **Mouwiya S. A. Al-Qaisieh** mo3awiya@gmail.com