Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
gokaygokay 
posted an update Jun 18
Post
5812
I've fine-tuned three types of PaliGemma image captioner models for generating prompts for Text2Image models. They generate captions similar to prompts we give to the image generation models. I used google/docci and google/imageinwords datasets for fine-tuning.

This one gives you longer captions.

gokaygokay/SD3-Long-Captioner

This one gives you middle size captions.

https://huggingface.co/spaces/gokaygokay/SD3-Long-Captioner-V2

And this one gives you shorter captions.

https://huggingface.co/spaces/gokaygokay/SDXL-Captioner

Are you able to share the code you used to fine tune the models?

·

What are your thoughts on Florence2? Do you think finetuning it on these datasets will help on the captioning task?

·

They've already fine-tuned the base model and it looks that its better at Segmentation and Object Detection with fine-tuned model. But captions are less detailed and short. Maybe thats a good thing about hallucinations but sometimes fine-tuned model gives almost no details. But for your question it looks like a fine-tunable model.