Edit model card

Model Card for Model ID

Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images

Model Description

Usage


from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image


model_id = "Fer14/paligemma_coffee_machine_caption"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)


image = Image.open("path to your image").convert("RGB")

prompt  = (
            f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
            "\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
            "in which:\n"
            "- color: red, black, blue...\n"
            "- type: coffee machine, coffee maker, espresso coffee machine...\n"
            "- accessories: a list of accessories like the ones described above\n"
            "- shape: cubed, round...\n"
            "- screen: screen, no screen.\n"
            "- number: amount of buttons to add\n"
            "- b_color: color of the buttons"
        )

inputs = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            padding="longest",
        )

output = model.generate(**inputs, max_length=1000)

decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.2
Downloads last month
15
Inference API (serverless) has been turned off for this model.

Finetuned from

Space using Fer14/paligemma_coffee_machine_caption 1

Collection including Fer14/paligemma_coffee_machine_caption