Model Card for Model ID

Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images

Model Description

Usage


from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image


model_id = "Fer14/paligemma_coffee_machine_caption"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)


image = Image.open("path to your image").convert("RGB")

prompt  = (
            f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
            "\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
            "in which:\n"
            "- color: red, black, blue...\n"
            "- type: coffee machine, coffee maker, espresso coffee machine...\n"
            "- accessories: a list of accessories like the ones described above\n"
            "- shape: cubed, round...\n"
            "- screen: screen, no screen.\n"
            "- number: amount of buttons to add\n"
            "- b_color: color of the buttons"
        )

inputs = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            padding="longest",
        )

output = model.generate(**inputs, max_length=1000)

decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.2
Downloads last month
41
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for Fer14/paligemma_coffee_machine_caption

Finetuned
(41)
this model

Space using Fer14/paligemma_coffee_machine_caption 1

Collection including Fer14/paligemma_coffee_machine_caption