@gokaygokay on Hugging Face: "I've fine-tuned three types of PaliGemma image captioner models for generating…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

gokaygokay

posted an update Jun 18

Post

5812

I've fine-tuned three types of PaliGemma image captioner models for generating prompts for Text2Image models. They generate captions similar to prompts we give to the image generation models. I used google/docci and google/imageinwords datasets for fine-tuning.

This one gives you longer captions.

gokaygokay/SD3-Long-Captioner

This one gives you middle size captions.

https://huggingface.co/spaces/gokaygokay/SD3-Long-Captioner-V2

And this one gives you shorter captions.

https://huggingface.co/spaces/gokaygokay/SDXL-Captioner

ljnlonoljpiljm

Jun 19

Are you able to share the code you used to fine tune the models?

gokaygokay

Jun 19

https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/paligemma/fine-tuning-paligemma.ipynb

I've used this fine-tuning notebook.

Warlord-K

Jun 20

What are your thoughts on Florence2? Do you think finetuning it on these datasets will help on the captioning task?

gokaygokay

Jun 20

•

edited Jun 20

They've already fine-tuned the base model and it looks that its better at Segmentation and Object Detection with fine-tuned model. But captions are less detailed and short. Maybe thats a good thing about hallucinations but sometimes fine-tuned model gives almost no details. But for your question it looks like a fine-tunable model.

In this post