|
--- |
|
library_name: big_vision |
|
license: gemma |
|
pipeline_tag: image-text-to-text |
|
extra_gated_heading: Access PaliGemma on Hugging Face |
|
extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review |
|
and agree to Google’s usage license. To do this, please ensure you’re logged-in |
|
to Hugging Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
# PaliGemma 2 model card |
|
|
|
> [!WARNING] |
|
> This is a preview and strictly confidential model only intended to test conversion and inference using transformers. |
|
> Changes are not final, please [report any issues you find](https://huggingface.co/gv-hf/paligemma2-10b-pt-448-jax/discussions/new). |
|
|
|
**Model page:** [PaliGemma](https://ai.google.dev/gemma/docs/paligemma) |
|
|
|
JAX/FLAX PaliGemma 2 10B weights for use with [`big_vision`](https://github.com/google-research/big_vision) codebase, |
|
pre-trained with 448*448 input images and 512 token input/output text sequences. |
|
|
|
The model is available in the `bfloat16` format for fine-tuning. |
|
|
|
**Downloading Model Weights** |
|
|
|
First, authenticate using the Hugging Face CLI: |
|
```bash |
|
huggingface-cli login |
|
``` |
|
|
|
Use the following command to download the model weights: |
|
```bash |
|
huggingface-cli download --local-dir models gv-hf/paligemma2-10b-pt-448-jax |
|
``` |
|
This will download the weights to the `models` directory. |
|
|
|
**Resources and technical documentation:** |
|
|
|
* [PaliGemma 2 on Kaggle](https://www.kaggle.com/models/google/paligemma-2) |
|
* [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) |
|
|
|
**Terms of Use:** [Terms](https://ai.google.dev/gemma/terms) |
|
|
|
**Authors:** Google |
|
|
|
## Model information |
|
|
|
### Model summary |
|
|
|
PaliGemma 2 is an update of the [PaliGemma](https://arxiv.org/abs/2407.07726) |
|
vision-language model (VLM) which incorporates the capabilities of the |
|
[Gemma 2](https://arxiv.org/abs/2408.00118) models. The PaliGemma family of |
|
models is inspired by [PaLI-3](https://arxiv.org/abs/2310.09199) and based on |
|
open components such as the [SigLIP](https://arxiv.org/abs/2303.15343) vision |
|
model and [Gemma 2](https://arxiv.org/abs/2408.00118) language models. It takes |
|
both image and text as input and generates text as output, supporting multiple |
|
languages. It is designed for class-leading fine-tune performance on a wide |
|
range of vision-language tasks such as image and short video caption, visual |
|
question answering, text reading, object detection and object segmentation. |
|
|
|
#### Model architecture |
|
|
|
PaliGemma 2 is the composition of a |
|
[Transformer decoder](https://arxiv.org/abs/1706.03762) and a |
|
[Vision Transformer image encoder](https://arxiv.org/abs/2010.11929). |
|
The text decoder is initialized from |
|
[Gemma 2](https://ai.google.dev/gemma/docs/base) in the 2B, 9B, and 27B |
|
parameter sizes. The image encoder is initialized from |
|
[SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb). |
|
Similar to the original PaliGemma model, PaliGemma 2 is trained following the |
|
[PaLI-3](https://arxiv.org/abs/2310.09199) recipes. |
|
|
|
#### Inputs and outputs |
|
|
|
* **Input:** Image and text string, such as a prompt to caption the image, or |
|
a question. |
|
* **Output:** Generated text in response to the input, such as a caption of |
|
the image, an answer to a question, a list of object bounding box |
|
coordinates, or segmentation codewords. |
|
|
|
### Model data |
|
|
|
#### Pre-train datasets |
|
|
|
PaliGemma 2 is pre-trained on the following mixture of datasets: |
|
|
|
* **WebLI:** [WebLI (Web Language Image)](https://arxiv.org/abs/2209.06794) is |
|
a web-scale multilingual image-text dataset built from the public web. A |
|
wide range of WebLI splits are used to acquire versatile model capabilities, |
|
such as visual semantic understanding, object localization, |
|
visually-situated text understanding, and multilinguality. |
|
* **CC3M-35L:** Curated English image-alt_text pairs from webpages |
|
([Sharma et al., 2018](https://aclanthology.org/P18-1238/)). We used the |
|
[Google Cloud Translation API](https://cloud.google.com/translate) to |
|
translate into 34 additional languages. |
|
* **VQ²A-CC3M-35L/VQG-CC3M-35L:** A subset of VQ2A-CC3M |
|
([Changpinyo et al., 2022a](https://aclanthology.org/2022.naacl-main.142/)), |
|
translated into the same additional 34 languages as CC3M-35L, using the |
|
[Google Cloud Translation API](https://cloud.google.com/translate). |
|
* **OpenImages:** Detection and object-aware questions and answers |
|
([Piergiovanni et al. 2022](https://arxiv.org/abs/2209.04372)) generated by |
|
handcrafted rules on the [OpenImages dataset]. |
|
* **WIT:** Images and texts collected from Wikipedia |
|
([Srinivasan et al., 2021](https://arxiv.org/abs/2103.01913)). |
|
|
|
[OpenImages dataset]: https://storage.googleapis.com/openimages/web/factsfigures_v7.html |
|
PaliGemma 2 is based on Gemma 2, and you can find information on the |
|
pre-training datasets for Gemma 2 in the |
|
[Gemma 2 model card](https://ai.google.dev/gemma/docs/model_card_2). |
|
|
|
#### Data responsibility filtering |
|
|
|
The following filters are applied to WebLI, with the goal of training PaliGemma |
|
2 on safe and responsible data: |
|
|
|
* **Pornographic image filtering:** This filter removes images deemed to be of |
|
pornographic nature. |
|
* **Text safety filtering:** We identify and filter out images that are paired |
|
with unsafe text. Unsafe text is any text deemed to contain or be about |
|
child sexual abuse imagery (CSAI), pornography, vulgarities, or is otherwise |
|
offensive. |
|
* **Text toxicity filtering:** We further use the [Perspective |
|
API](https://perspectiveapi.com/) to identify and filter out images that are |
|
paired with text deemed insulting, obscene, hateful or otherwise toxic. |
|
* **Text personal information filtering:** We filtered certain personal |
|
information and other sensitive data using the [Cloud Data Loss Prevention |
|
(DLP) API](https://cloud.google.com/security/products/dlp) to protect the |
|
privacy of individuals. Identifiers such as social security numbers and |
|
[other sensitive information types] were removed. |
|
* **Additional methods:** Filtering based on content quality and safety in |
|
line with our policies and practices. |
|
|
|
[other sensitive information types]: https://cloud.google.com/sensitive-data-protection/docs/high-sensitivity-infotypes-reference?_gl=1*jg604m*_ga*ODk5MzA3ODQyLjE3MTAzMzQ3NTk.*_ga_WH2QY8WWF5*MTcxMDUxNTkxMS4yLjEuMTcxMDUxNjA2NC4wLjAuMA..&_ga=2.172110058.-899307842.1710334759 |
|
|
|
## Implementation information |
|
|
|
### Hardware |
|
|
|
PaliGemma 2 was trained using the latest generation of Tensor Processing Unit |
|
(TPU) hardware (TPUv5e). |
|
|
|
### Software |
|
|
|
Training was completed using [JAX](https://github.com/google/jax), |
|
[Flax](https://github.com/google/flax), |
|
[TFDS](https://github.com/tensorflow/datasets) and |
|
[`big_vision`](https://github.com/google-research/big_vision). |
|
|
|
JAX allows researchers to take advantage of the latest generation of hardware, |
|
including TPUs, for faster and more efficient training of large models. |
|
|
|
TFDS is used to access datasets and Flax is used for model architecture. The |
|
PaliGemma 2 fine-tune code and inference code are released in the `big_vision` |
|
GitHub repository. |
|
|
|
## Evaluation information |
|
|
|
### Benchmark results |
|
|
|
In order to verify the transferability of PaliGemma 2 to a wide variety of |
|
academic tasks, we fine-tune the pretrained models on each task. We report results on |
|
different resolutions to provide an impression of which tasks benefit from |
|
increased resolution. Importantly, none of these tasks or datasets are part of |
|
the pretraining data mixture, and their images are explicitly removed from the |
|
web-scale pre-training data. |
|
|
|
#### PaliGemma 2 results by model resolution and size |
|
|
|
| Benchmark | 224-3B | 224-10B | 224-28B | 448-3B | 448-10B | 448-28B | |
|
|-------------------------------|:------:|:-------:|:-------:|:------:|:-------:|:-------:| |
|
| [AI2D][ai2d] | 74.7 | 83.1 | 83.2 | 76.0 | 84.4 | 84.6 | |
|
| [AOKVQA-DA][aokvqa-da] (val) | 64.2 | 68.9 | 70.2 | 67.9 | 70.8 | 71.2 | |
|
| [AOKVQA-MC][aokvqa-mc] (val) | 79.7 | 83.7 | 84.7 | 82.5 | 85.9 | 87.0 | |
|
| [ActivityNet-CAP][anet-cap] | 34.2 | 35.9 | - | - | - | - | |
|
| [ActivityNet-QA][anet-qa] | 51.3 | 53.2 | - | - | - | - | |
|
| [COCO-35L][coco-35l] (avg34) | 113.9 | 115.8 | 116.5 | 115.8 | 117.2 | 117.2 | |
|
| [COCO-35L][coco-35l] (en) | 138.4 | 140.8 | 142.4 | 140.4 | 142.4 | 142.3 | |
|
| [COCOcap][coco-cap] | 141.3 | 143.7 | 144.0 | 143.4 | 145.0 | 145.2 | |
|
| [ChartQA][chartqa] (aug) | 74.4 | 74.2 | 68.9 | 89.2 | 90.1 | 85.1 | |
|
| [ChartQA][chartqa] (human) | 42.0 | 48.4 | 46.8 | 54.0 | 66.4 | 61.3 | |
|
| [CountBenchQA][countbenchqa] | 81.0 | 84.0 | 86.4 | 82.0 | 85.3 | 87.4 | |
|
| [DocVQA][docvqa] (val) | 39.9 | 43.9 | 44.9 | 73.6 | 76.6 | 76.1 | |
|
| [GQA][gqa] | 66.2 | 67.2 | 67.3 | 68.1 | 68.3 | 68.3 | |
|
| [InfoVQA][info-vqa] (val) | 25.2 | 33.6 | 36.4 | 37.5 | 47.8 | 46.7 | |
|
| [MARVL][marvl] (avg5) | 83.5 | 89.5 | 90.6 | 82.7 | 89.1 | 89.7 | |
|
| [MSRVTT-CAP][msrvtt] | 68.5 | 72.1 | - | - | - | - | |
|
| [MSRVTT-QA][msrvtt] | 50.5 | 51.9 | - | - | - | - | |
|
| [MSVD-QA][msvd-qa] | 61.1 | 62.5 | - | - | - | - | |
|
| [NLVR2][nlvr2] | 91.4 | 93.9 | 94.2 | 91.6 | 93.7 | 94.1 | |
|
| [NoCaps][nocaps] | 123.1 | 126.3 | 127.1 | 123.5 | 126.9 | 127.0 | |
|
| [OCR-VQA][ocr-vqa] | 73.4 | 74.7 | 75.3 | 75.7 | 76.3 | 76.6 | |
|
| [OKVQA][okvqa] | 64.2 | 68.0 | 71.2 | 64.1 | 68.6 | 70.6 | |
|
| [RSVQA-hr][rsvqa-hr] (test) | 92.7 | 92.6 | 92.7 | 92.8 | 92.8 | 92.8 | |
|
| [RSVQA-hr][rsvqa-hr] (test2) | 90.9 | 90.8 | 90.9 | 90.7 | 90.7 | 90.8 | |
|
| [RSVQA-lr][rsvqa-lr] | 93.0 | 92.8 | 93.5 | 92.7 | 93.1 | 93.7 | |
|
| [RefCOCO][refcoco] (testA) | 75.7 | 77.2 | 76.8 | 78.6 | 79.7 | 79.3 | |
|
| [RefCOCO][refcoco] (testB) | 71.0 | 74.2 | 73.9 | 73.5 | 76.2 | 74.8 | |
|
| [RefCOCO][refcoco] (val) | 73.4 | 75.9 | 75.0 | 76.3 | 78.2 | 77.3 | |
|
| [RefCOCO+][refcoco+] (testA) | 72.7 | 74.7 | 73.6 | 76.1 | 77.7 | 76.6 | |
|
| [RefCOCO+][refcoco+] (testB) | 64.2 | 68.4 | 67.1 | 67.0 | 71.1 | 68.6 | |
|
| [RefCOCO+][refcoco+] (val) | 68.6 | 72.0 | 70.3 | 72.1 | 74.4 | 72.8 | |
|
| [RefCOCOg][refcocog] (test) | 69.0 | 71.9 | 70.7 | 72.7 | 74.8 | 73.7 | |
|
| [RefCOCOg][refcocog] (val) | 68.3 | 71.4 | 70.5 | 72.3 | 74.4 | 73.0 | |
|
| [ST-VQA][st-vqa] (val) | 61.9 | 64.3 | 65.1 | 80.5 | 82.0 | 81.8 | |
|
| [SciCap][scicap] | 165.1 | 159.5 | 156.9 | 183.3 | 177.2 | 172.7 | |
|
| [ScienceQA][scienceqa] | 96.1 | 98.2 | 98.2 | 96.2 | 98.5 | 98.6 | |
|
| [Screen2Words][screen2words] | 113.3 | 117.8 | 122.8 | 114.0 | 119.1 | 123.4 | |
|
| [TallyQA][tallyqa] (complex) | 70.3 | 73.4 | 74.2 | 73.6 | 76.7 | 76.8 | |
|
| [TallyQA][tallyqa] (simple) | 81.8 | 83.2 | 83.4 | 85.3 | 86.2 | 85.7 | |
|
| [TextCaps][textcaps] | 127.5 | 137.9 | 139.9 | 152.1 | 157.7 | 153.6 | |
|
| [TextVQA][textvqa] (val) | 59.6 | 64.0 | 64.7 | 75.2 | 76.6 | 76.2 | |
|
| [VATEX][vatex] | 80.8 | 82.7 | - | - | - | - | |
|
| [VQAv2][vqav2] (minival) | 83.0 | 84.3 | 84.5 | 84.8 | 85.8 | 85.8 | |
|
| [VizWizVQA][vizwiz-vqa] (val) | 76.4 | 78.1 | 78.7 | 77.5 | 78.6 | 78.9 | |
|
| [WidgetCap][widgetcap] | 138.1 | 139.8 | 138.8 | 151.4 | 151.9 | 148.9 | |
|
| [XM3600][xm3600] (avg35) | 42.8 | 44.5 | 45.2 | 43.2 | 44.6 | 45.2 | |
|
| [XM3600][xm3600] (en) | 79.8 | 80.7 | 81.0 | 80.3 | 81.5 | 81.0 | |
|
| [xGQA][xgqa] (avg7) | 58.6 | 61.4 | 61.1 | 60.4 | 62.6 | 62.1 | |
|
|
|
|
|
#### Additional Benchmarks |
|
|
|
**[ICDAR 2015 Incidental][icdar2015-inc]** |
|
|
|
| Model | Precision | Recall | F1 | |
|
|-----------------|-----------|:------:|:-----:| |
|
| PaliGemma 2 3B | 81.88 | 70.73 | 75.9 | |
|
|
|
**[Total-Text][total-text]** |
|
|
|
| Model | Precision | Recall | F1 | |
|
|-----------------|-----------|:------:|:-----:| |
|
| PaliGemma 2 3B | 73.8. | 74.54 | 74.17 | |
|
|
|
**[FinTabNet][fintabnet]** |
|
|
|
| Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con | |
|
|-----------------|--------|-------|-----------|-----------| |
|
| PaliGemma 2 3B | 99.18 | 98.94 | 99.43 | 99.21 | |
|
|
|
**[PubTabNet][pubtabnet]** |
|
|
|
| Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con | |
|
|-----------------|--------|-------|-----------|-----------| |
|
| PaliGemma 2 3B | 97.6 | 97.31 | 97.99 | 97.84 | |
|
|
|
**[GrandStaff][grandstaff]** |
|
|
|
| Model | CER | LER | SER | |
|
|-----------------|-----|-----|-----| |
|
| PaliGemma 2 3B | 1.6 | 6.7 | 2.3 | |
|
|
|
**[PubChem][pubchem]** |
|
|
|
* PaliGemma 2 3B, Full Match: 94.8 |
|
|
|
**[DOCCI][docci]** |
|
|
|
| Model | avg#char | avg#sent | NES % | |
|
|-----------------|----------|----------|---------| |
|
| PaliGemma 2 3B | 529 | 7.74 | 28.42 | |
|
| PaliGemma 2 10B | 521 | 7.45 | 20.27 | |
|
|
|
- *avg#char*: Average number of characters |
|
- *avg#sent*: Average number of characters |
|
- *NES*: Number of entailment sentences |
|
|
|
**[MIMIC-CXR][mimic-cxr]** |
|
|
|
| Model | CIDEr | BLEU4 | Rouge-L | RadGraph F1 | |
|
|-----------------|-------|-------|---------|-------------| |
|
| PaliGemma 2 3B | 19.9% | 14.6% | 31.92% | 28.8% | |
|
| PaliGemma 2 10B | 17.4% | 15% | 32.41% | 29.5% | |
|
|
|
**[Visual Spatial Reasoning][vsr]** |
|
|
|
| Model | VSR zeroshot split (test) | VSR random split (test) | |
|
|-----------------|---------------------------|--------------------------| |
|
| PaliGemma 2 3B | 0.75 | 0.82 | |
|
| PaliGemma 2 10B | 0.80 | 0.87 | |
|
|
|
## Ethics and safety |
|
|
|
### Evaluation approach |
|
|
|
Our evaluation methods include structured ethics and safety evaluations across |
|
relevant content policies, including: |
|
|
|
* Human evaluation on prompts covering child safety, content safety and |
|
representational harms. See the [Gemma model |
|
card](https://ai.google.dev/gemma/docs/model_card#evaluation_approach) for |
|
more details on evaluation approach, but with image captioning and visual |
|
question answering setups. |
|
* Image-to-Text benchmark evaluation: Benchmark against relevant academic |
|
datasets such as FairFace Dataset ([Karkkainen et al., |
|
2021](https://arxiv.org/abs/1908.04913)). |
|
|
|
### Evaluation results |
|
|
|
* The human evaluation results of ethics and safety evaluations are within |
|
acceptable thresholds for meeting [internal |
|
policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11) |
|
for categories such as child safety, content safety and representational |
|
harms. |
|
* On top of robust internal evaluations, we also use the Perspective API |
|
(threshold of 0.8) to measure toxicity, profanity, and other potential |
|
issues in the generated captions for images sourced from the FairFace |
|
dataset. We report the maximum and median values observed across subgroups |
|
for each of the perceived gender, ethnicity, and age attributes. |
|
|
|
<table> |
|
<tr> |
|
<col> |
|
<colgroup span="3"></colgroup> |
|
<colgroup span="3"></colgroup> |
|
<colgroup span="3"></colgroup> |
|
<th>Metric</th> |
|
<th colspan="3" scope="colgroup">Perceived gender</th> |
|
<th colspan="3" scope="colgroup">Ethnicity</th> |
|
<th colspan="3" scope="colgroup">Age group</th> |
|
</tr> |
|
<tr> |
|
<th>Model size</th> |
|
<th scope="col">3B</th> |
|
<th scope="col">10B</th> |
|
<th scope="col">28B</th> |
|
<th scope="col">3B</th> |
|
<th scope="col">10B</th> |
|
<th scope="col">28B</th> |
|
<th scope="col">3B</th> |
|
<th scope="col">10B</th> |
|
<th scope="col">28B</th> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th colspan="9" scope="colgroup">Maximum</th> |
|
</tr> |
|
<tr> |
|
<td>Toxicity</td> |
|
<td>0.14%</td> |
|
<td>0.15%</td> |
|
<td>0.19%</td> |
|
<td>0.29%</td> |
|
<td>0.39%</td> |
|
<td>0.39%</td> |
|
<td>0.26%</td> |
|
<td>0.18%</td> |
|
<td>0.32%</td> |
|
</tr> |
|
<tr> |
|
<td>Identity Attack</td> |
|
<td>0.04%</td> |
|
<td>0.02%</td> |
|
<td>0.02%</td> |
|
<td>0.13%</td> |
|
<td>0.06%</td> |
|
<td>0.06%</td> |
|
<td>0.06%</td> |
|
<td>0.03%</td> |
|
<td>0.06%</td> |
|
</tr> |
|
<tr> |
|
<td>Insult</td> |
|
<td>0.17%</td> |
|
<td>0.25%</td> |
|
<td>0.17%</td> |
|
<td>0.37%</td> |
|
<td>0.52%</td> |
|
<td>0.52%</td> |
|
<td>0.27%</td> |
|
<td>0.39%</td> |
|
<td>0.24%</td> |
|
</tr> |
|
<tr> |
|
<td>Threat</td> |
|
<td>0.55%</td> |
|
<td>0.43%</td> |
|
<td>0.57%</td> |
|
<td>0.83%</td> |
|
<td>0.48%</td> |
|
<td>0.48%</td> |
|
<td>0.64%</td> |
|
<td>0.43%</td> |
|
<td>0.64%</td> |
|
</tr> |
|
<tr> |
|
<td>Profanity</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th colspan="9" scope="colgroup">Median</th> |
|
</tr> |
|
<tr> |
|
<td>Toxicity</td> |
|
<td>0.13%</td> |
|
<td>0.10%</td> |
|
<td>0.18%</td> |
|
<td>0.07%</td> |
|
<td>0.07%</td> |
|
<td>0.14%</td> |
|
<td>0.12%</td> |
|
<td>0.08%</td> |
|
<td>0.12%</td> |
|
</tr> |
|
<tr> |
|
<td>Identity Attack</td> |
|
<td>0.02%</td> |
|
<td>0.01%</td> |
|
<td>0.02%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Insult</td> |
|
<td>0.15%</td> |
|
<td>0.23%</td> |
|
<td>0.14%</td> |
|
<td>0.14%</td> |
|
<td>0.17%</td> |
|
<td>0.13%</td> |
|
<td>0.09%</td> |
|
<td>0.18%</td> |
|
<td>0.16%</td> |
|
</tr> |
|
<tr> |
|
<td>Threat</td> |
|
<td>0.35%</td> |
|
<td>0.27%</td> |
|
<td>0.41%</td> |
|
<td>0.28%</td> |
|
<td>0.19%</td> |
|
<td>0.42%</td> |
|
<td>0.27%</td> |
|
<td>0.31%</td> |
|
<td>0.40%</td> |
|
</tr> |
|
<tr> |
|
<td>Profanity</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
</table> |
|
|
|
## Usage and limitations |
|
|
|
### Intended usage |
|
|
|
Open Vision Language Models (VLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. Prohibited uses of Gemma models are outlined in the |
|
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
|
|
Fine-tune on specific vision-language task: |
|
|
|
* The pre-trained models can be fine-tuned on a wide range of vision-language |
|
tasks such as: image captioning, short video caption, visual question |
|
answering, text reading, object detection and object segmentation. |
|
* The pre-trained models can be fine-tuned for specific domains such as remote |
|
sensing question answering, visual questions from people who are blind, |
|
science question answering, describe UI element functionalities. |
|
* The pre-trained models can be fine-tuned for tasks with non-textual outputs |
|
such as bounding boxes or segmentation masks. |
|
|
|
Vision-language research: |
|
|
|
* The pre-trained models and fine-tuned models can serve as a foundation for |
|
researchers to experiment with VLM techniques, develop algorithms, and |
|
contribute to the advancement of the field. |
|
|
|
### Ethical considerations and risks |
|
|
|
The development of vision-language models (VLMs) raises several ethical |
|
concerns. In creating an open model, we have carefully considered the following: |
|
|
|
* Bias and Fairness |
|
* VLMs trained on large-scale, real-world image-text data can reflect |
|
socio-cultural biases embedded in the training material. These models |
|
underwent careful scrutiny, input data pre-processing described and |
|
posterior evaluations reported in this card. |
|
* Misinformation and Misuse |
|
* VLMs can be misused to generate text that is false, misleading, or |
|
harmful. |
|
* Guidelines are provided for responsible use with the model, see the |
|
[Responsible Generative AI Toolkit](https://ai.google.dev/responsible). |
|
* Transparency and Accountability |
|
* This model card summarizes details on the models' architecture, |
|
capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share |
|
innovation by making VLM technology accessible to developers and |
|
researchers across the AI ecosystem. |
|
|
|
Risks identified and mitigations: |
|
|
|
* **Perpetuation of biases:** It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* **Generation of harmful content:** Mechanisms and guidelines for content |
|
safety are essential. Developers are encouraged to exercise caution and |
|
implement appropriate content safety safeguards based on their specific |
|
product policies and application use cases. |
|
* **Misuse for malicious purposes:** Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided: see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible). |
|
Prohibited uses of Gemma models are outlined in the |
|
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
* **Privacy violations:** Models were trained on data filtered to remove |
|
certain personal information and sensitive data. Developers are encouraged |
|
to adhere to privacy regulations with privacy-preserving techniques. |
|
|
|
### Limitations |
|
|
|
* Most limitations inherited from the underlying Gemma 2 models still apply: |
|
* VLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* Natural language is inherently complex. VLMs might struggle to grasp |
|
subtle nuances, sarcasm, or figurative language. |
|
* VLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* VLMs rely on statistical patterns in language and images. They might |
|
lack the ability to apply common sense reasoning in certain situations. |
|
* PaliGemma 2 was designed first and foremost to serve as a general |
|
pre-trained model for fine-tuning to specialized tasks. Hence, its "out of |
|
the box" or "zero-shot" performance might lag behind models designed |
|
specifically for general purpose use. |
|
* PaliGemma 2 is not a multi-turn chatbot. It is designed for a single round |
|
of image and text input. |
|
|
|
|
|
[ai2d]: https://allenai.org/data/diagrams |
|
[aokvqa-da]: https://allenai.org/project/a-okvqa/home |
|
[aokvqa-mc]: https://allenai.org/project/a-okvqa/home |
|
[anet-cap]: https://paperswithcode.com/dataset/activitynet-captions |
|
[anet-qa]: https://arxiv.org/abs/1906.02467 |
|
[chartqa]: https://arxiv.org/abs/2203.10244 |
|
[coco-35l]: https://arxiv.org/pdf/2205.12522 |
|
[coco-cap]: https://cocodataset.org/#home |
|
[countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/ |
|
[docvqa]: https://www.docvqa.org/ |
|
[gqa]: https://cs.stanford.edu/people/dorarad/gqa/about.html |
|
[info-vqa]: https://arxiv.org/abs/2104.12756 |
|
[marvl]: https://marvl-challenge.github.io/ |
|
[msrvtt]: https://paperswithcode.com/dataset/msr-vtt |
|
[msvd-qa]: https://paperswithcode.com/dataset/msvd-qa |
|
[nlvr2]: https://lil.nlp.cornell.edu/nlvr/ |
|
[nocaps]: https://nocaps.org/ |
|
[ocr-vqa]: https://ocr-vqa.github.io/ |
|
[okvqa]: https://okvqa.allenai.org/ |
|
[refcoco]: https://arxiv.org/abs/1608.00272 |
|
[refcoco+]: https://aclanthology.org/D14-1086 |
|
[refcocog]: https://arxiv.org/abs/1511.02283 |
|
[rsvqa-hr]: https://zenodo.org/records/6344367 |
|
[rsvqa-lr]: https://zenodo.org/records/6344334 |
|
[st-vqa]: https://arxiv.org/abs/1905.13648 |
|
[scicap]: https://arxiv.org/abs/2110.11624 |
|
[scienceqa]: https://scienceqa.github.io/ |
|
[screen2words]: https://arxiv.org/abs/2108.03353 |
|
[tallyqa]: https://arxiv.org/abs/1810.12440 |
|
[textcaps]: https://textvqa.org/textcaps/ |
|
[textvqa]: https://textvqa.org/ |
|
[vatex]: https://arxiv.org/abs/1904.03493 |
|
[vizwiz-vqa]: https://vizwiz.org/tasks-and-datasets/vqa/ |
|
[widgetcap]: https://arxiv.org/abs/2010.04295 |
|
[vqav2]: https://visualqa.org/index.html |
|
[xgqa]: https://aclanthology.org/2022.findings-acl.196/ |
|
[xm3600]: https://arxiv.org/pdf/2205.12522 |
|
|
|
[icdar2015-inc]: https://arxiv.org/abs/1511.09207 |
|
[total-text]: https://paperswithcode.com/paper/total-text-a-comprehensive-dataset-for-scene |
|
[fintabnet]: https://developer.ibm.com/data/fintabnet/ |
|
[pubtabnet]: https://paperswithcode.com/dataset/pubtabnet |
|
[grandstaff]: https://link.springer.com/article/10.1007/s10032-023-00432-z |
|
[pubchem]: https://pmc.ncbi.nlm.nih.gov/articles/PMC7352161/ |
|
[docci]: https://research.google/pubs/docci-descriptions-of-connected-and-contrasting-images/ |
|
[mimic-cxr]: https://paperswithcode.com/dataset/mimic-cxr |
|
[vsr]: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00566/116470/Visual-Spatial-Reasoning |
|
|