Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveΒ 
posted an update May 14
Post
1755
New open Vision Language Model by @Google : PaliGemma πŸ’™πŸ€

πŸ“ Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
πŸ€— Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🀯
πŸ“‘ Detailed document understanding and reasoning
πŸ™‹ Visual question answering, captioning and any other VLM task!

Read our blog πŸ”– hf.co/blog/paligemma
Try the demo πŸͺ€ hf.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection πŸ“š google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda

Nice scores in benchmarks, but it failed at my first test image: https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

It might be something wrong with demo space configuration, or... we need better benchmarks.

Β·

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

Hi! nice work!
I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.
Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?
is there any plan to open-source a chatty model?

Β·

@Cuiunbo I think @giffmana et al will release a technical report in the upcoming days. for mix models and finetuned models the details should be in the model cards. for chatty model I think it's not the intention of this release.