|
--- |
|
library_name: big_vision |
|
license: gemma |
|
pipeline_tag: image-text-to-text |
|
extra_gated_heading: Access PaliGemma on Hugging Face |
|
extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review |
|
and agree to Google’s usage license. To do this, please ensure you’re logged-in |
|
to Hugging Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
# PaliGemma 2 model card |
|
|
|
> [!WARNING] |
|
> This is a preview and strictly confidential model only intended to test conversion and inference using transformers. |
|
> Changes are not final, please [report any issues you find](https://huggingface.co/gv-hf/paligemma2-ft-docci-10b-448-jax/discussions/new). |
|
|
|
**Model page:** [PaliGemma](https://ai.google.dev/gemma/docs/paligemma) |
|
|
|
JAX/FLAX PaliGemma 2 10B weights for use with [`big_vision`](https://github.com/google-research/big_vision) codebase, |
|
fine-tuned with 448*448 input images on the <a href="https://google.github.io/docci/">DOCCI</a> dataset. |
|
|
|
The model is available in the `bfloat16` format for research purposes only. |
|
|
|
The fine-tune config is available at <a href="https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/paligemma/transfers">big_vision</a>. |
|
|
|
**Resources and technical documentation:** |
|
* [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) |
|
* [PaliGemma 2 on Kaggle](https://www.kaggle.com/models/google/paligemma-2) |
|
|
|
**Terms of Use:** [Terms](https://ai.google.dev/gemma/terms) |
|
|
|
**Authors:** Google |
|
|
|
## Model information |
|
|
|
### Model summary |
|
|
|
PaliGemma 2 is an update of the [PaliGemma](https://arxiv.org/abs/2407.07726) |
|
vision-language model (VLM) which incorporates the capabilities of the |
|
[Gemma 2](https://arxiv.org/abs/2408.00118) models. The PaliGemma family of |
|
models is inspired by [PaLI-3](https://arxiv.org/abs/2310.09199) and based on |
|
open components such as the [SigLIP](https://arxiv.org/abs/2303.15343) vision |
|
model and [Gemma 2](https://arxiv.org/abs/2408.00118) language models. It takes |
|
both image and text as input and generates text as output, supporting multiple |
|
languages. It is designed for class-leading fine-tune performance on a wide |
|
range of vision-language tasks such as image and short video caption, visual |
|
question answering, text reading, object detection and object segmentation. |
|
|
|
#### Model architecture |
|
|
|
PaliGemma 2 is the composition of a |
|
[Transformer decoder](https://arxiv.org/abs/1706.03762) and a |
|
[Vision Transformer image encoder](https://arxiv.org/abs/2010.11929). |
|
The text decoder is initialized from |
|
[Gemma-2](https://ai.google.dev/gemma/docs/base) in the 2B, 9B, and 27B |
|
parameter sizes. The image encoder is initialized from |
|
[SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb). |
|
Similar to the original PaliGemma model, PaliGemma 2 is trained following the |
|
[PaLI-3](https://arxiv.org/abs/2310.09199) recipes. |
|
|
|
#### Inputs and outputs |
|
|
|
* **Input:** Image and text string, such as a prompt to caption the image, or |
|
a question. |
|
* **Output:** Generated text in response to the input, such as a caption of |
|
the image, an answer to a question, a list of object bounding box |
|
coordinates, or segmentation codewords. |
|
### Model data |
|
|
|
#### Pre-train datasets |
|
|
|
PaliGemma 2 is pre-trained on the following mixture of datasets: |
|
|
|
* **WebLI:** [WebLI (Web Language Image)](https://arxiv.org/abs/2209.06794) is |
|
a web-scale multilingual image-text dataset built from the public web. A |
|
wide range of WebLI splits are used to acquire versatile model capabilities, |
|
such as visual semantic understanding, object localization, |
|
visually-situated text understanding, and multilinguality. |
|
* **CC3M-35L:** Curated English image-alt_text pairs from webpages |
|
([Sharma et al., 2018](https://aclanthology.org/P18-1238/)). We used the |
|
[Google Cloud Translation API](https://cloud.google.com/translate) to |
|
translate into 34 additional languages. |
|
* **VQ²A-CC3M-35L/VQG-CC3M-35L:** A subset of VQ2A-CC3M |
|
([Changpinyo et al., 2022a](https://aclanthology.org/2022.naacl-main.142/)), |
|
translated into the same additional 34 languages as CC3M-35L, using the |
|
[Google Cloud Translation API](https://cloud.google.com/translate). |
|
* **OpenImages:** Detection and object-aware questions and answers |
|
([Piergiovanni et al. 2022](https://arxiv.org/abs/2209.04372)) generated by |
|
handcrafted rules on the [OpenImages dataset]. |
|
* **WIT:** Images and texts collected from Wikipedia |
|
([Srinivasan et al., 2021](https://arxiv.org/abs/2103.01913)). |
|
[OpenImages dataset]: https://storage.googleapis.com/openimages/web/factsfigures_v7.html |
|
|
|
#### Data responsibility filtering |
|
|
|
The following filters are applied to WebLI, with the goal of training PaliGemma |
|
2 on clean data: |
|
|
|
* **Pornographic image filtering:** This filter removes images deemed to be of |
|
pornographic nature. |
|
* **Text safety filtering:** We identify and filter out images that are paired |
|
with unsafe text. Unsafe text is any text deemed to contain or be about |
|
CSAI, pornography, vulgarities, or otherwise offensive. |
|
* **Text toxicity filtering:** We further use the [Perspective |
|
API](https://perspectiveapi.com/) to identify and filter out images that are |
|
paired with text deemed insulting, obscene, hateful or otherwise toxic. |
|
* **Text personal information filtering:** We filtered certain personal |
|
information and other sensitive data using the [Cloud Data Loss Prevention |
|
(DLP) API](https://cloud.google.com/security/products/dlp) to protect the |
|
privacy of individuals. Identifiers such as social security numbers and |
|
[other sensitive information types] were removed. |
|
* **Additional methods:** Filtering based on content quality and safety in |
|
line with our policies and practices. |
|
[other sensitive information types]: https://cloud.google.com/sensitive-data-protection/docs/high-sensitivity-infotypes-reference?_gl=1*jg604m*_ga*ODk5MzA3ODQyLjE3MTAzMzQ3NTk.*_ga_WH2QY8WWF5*MTcxMDUxNTkxMS4yLjEuMTcxMDUxNjA2NC4wLjAuMA..&_ga=2.172110058.-899307842.1710334759 |
|
|
|
## Implementation information |
|
|
|
### Hardware |
|
|
|
PaliGemma 2 was trained using the latest generation of Tensor Processing Unit |
|
(TPU) hardware (TPUv5e). |
|
|
|
### Software |
|
|
|
Training was done using [JAX](https://github.com/google/jax), |
|
[Flax](https://github.com/google/flax), |
|
[TFDS](https://github.com/tensorflow/datasets) and |
|
[`big_vision`](https://github.com/google-research/big_vision). |
|
|
|
JAX allows researchers to take advantage of the latest generation of hardware, |
|
including TPUs, for faster and more efficient training of large models. |
|
|
|
TFDS is used to access datasets and Flax is used for model architecture. The |
|
PaliGemma 2 fine-tune code and inference code are released in the `big_vision` |
|
GitHub repository. |
|
|
|
## Evaluation information |
|
|
|
### Benchmark results |
|
|
|
In order to verify the transferability of PaliGemma 2 to a wide variety of |
|
academic tasks, we fine-tune the pretrained models on each task. Additionally we |
|
train the mix model with a mixture of the transfer tasks. We report results on |
|
different resolutions to provide an impression of which tasks benefit from |
|
increased resolution. Importantly, none of these tasks or datasets are part of |
|
the pretraining data mixture, and their images are explicitly removed from the |
|
web-scale pre-training data. |
|
|
|
#### PaliGemma 2 results by model resolution and size |
|
|
|
| Benchmark | 224-3B | 224-10B | 224-28B | 448-3B | 448-10B | 448-28B | |
|
|-------------------------------|--------|:-------:|:-------:|:------:|:-------:|:-------:| |
|
| [AI2D][ai2d] | 74.4 | 82.8 | 83.1 | 75.6 | 84.6 | 84.5 | |
|
| [AOKVQA-DA][aokvqa-da] (val) | 64.2 | 69.0 | 69.9 | 68.2 | 70.9 | 71.2 | |
|
| [AOKVQA-MC][aokvqa-mc] (val) | 78.9 | 83.1 | 84.9 | 82.4 | 86.1 | 87.1 | |
|
| [ActivityNet-CAP][anet-cap] | 34.5 | 35.8 | - | - | - | - | |
|
| [ActivityNet-QA][anet-qa] | 51.3 | 53.6 | - | - | - | - | |
|
| [COCO-35L][coco-35l] (avg34) | 113.8 | 115.8 | 116.5 | 115.8 | 117.3 | 117.2 | |
|
| [COCO-35L][coco-35l] (en) | 138.2 | 140.7 | 142.9 | 141.0 | 142.0 | 141.7 | |
|
| [COCOcap][coco-cap] | 141.4 | 143.7 | 144.0 | 143.1 | 144.8 | 145.4 | |
|
| [ChartQA][chartqa] (avg) | 57.7 | 60.8 | 58.0 | 71.6 | 78.4 | 73.0 | |
|
| [CountBenchQA][countbenchqa] | 81.0 | 85.7 | 86.3 | 80.6 | 87.1 | 87.3 | |
|
| [DocVQA][docvqa] (val) | 39.6 | 43.0 | 45.0 | 74.1 | 76.8 | 75.7 | |
|
| [GQA][gqa] | 66.4 | 67.4 | 67.3 | 68.2 | 68.6 | 68.2 | |
|
| [InfoVQA][info-vqa] (val) | 25.5 | 33.8 | 36.3 | 37.8 | 47.8 | 46.8 | |
|
| [MARVL][marvl] (avg5) | 83.4 | 89.4 | 90.8 | 82.5 | 89.1 | 89.8 | |
|
| [MSRVTT-CAP][msrvtt] | 67.5 | 71.4 | - | - | - | - | |
|
| [MSRVTT-QA][msrvtt] | 50.4 | 51.8 | - | - | - | - | |
|
| [MSVD-QA][msvd-qa] | 61.0 | 62.3 | - | - | - | - | |
|
| [NLVR2][nlvr2] | 91.5 | 93.7 | 94.1 | 91.4 | 94.0 | 94.0 | |
|
| [NoCaps][nocaps] | 123.4 | 126.5 | 126.9 | 123.6 | 126.9 | 127.1 | |
|
| [OCR-VQA][ocr-vqa] | 73.4 | 74.6 | 75.1 | 75.5 | 76.1 | 76.4 | |
|
| [OKVQA][okvqa] | 64.0 | 68.0 | 71.3 | 64.5 | 68.2 | 70.8 | |
|
| [RSVQA-hr][rsvqa-hr] (test) | 92.7 | 92.6 | 92.7 | 92.8 | 92.8 | 92.9 | |
|
| [RSVQA-hr][rsvqa-hr] (test2) | 90.8 | 90.9 | 90.7 | 90.7 | 90.8 | 90.9 | |
|
| [RSVQA-lr][rsvqa-lr] | 93.3 | 93.7 | 93.6 | 92.3 | 93.2 | 93.3 | |
|
| [RefCOCO][refcoco] (testA) | 75.9 | 77.1 | 76.8 | 78.7 | 79.7 | 79.2 | |
|
| [RefCOCO][refcoco] (testB) | 70.8 | 74.1 | 74.0 | 73.6 | 76.1 | 74.9 | |
|
| [RefCOCO][refcoco] (val) | 73.6 | 75.8 | 75.0 | 76.2 | 78.3 | 77.2 | |
|
| [RefCOCO+][refcoco+] (testA) | 72.9 | 74.9 | 73.5 | 76.4 | 77.6 | 76.8 | |
|
| [RefCOCO+][refcoco+] (testB) | 64.3 | 68.1 | 67.2 | 66.7 | 71.3 | 68.6 | |
|
| [RefCOCO+][refcoco+] (val) | 68.6 | 71.9 | 70.1 | 71.9 | 74.5 | 72.8 | |
|
| [RefCOCOg][refcocog] (test) | 68.8 | 72.1 | 70.7 | 72.6 | 74.8 | 73.9 | |
|
| [RefCOCOg][refcocog] (val) | 68.0 | 71.3 | 70.7 | 72.3 | 74.3 | 73.1 | |
|
| [ST-VQA][st-vqa] (val) | 62.0 | 64.2 | 65.6 | 80.5 | 81.6 | 81.9 | |
|
| [SciCap][scicap] | 165.7 | 158.8 | - | 183.8 | 177.0 | - | |
|
| [ScienceQA][scienceqa] | 95.7 | 98.4 | 98.1 | 96.4 | 98.2 | 98.7 | |
|
| [Screen2Words][screen2words] | 114.1 | 118.1 | 122.9 | 113.6 | 119.4 | 123.0 | |
|
| [TallyQA][tallyqa] (complex) | 70.0 | 73.5 | 74.1 | 73.8 | 77.1 | 77.1 | |
|
| [TallyQA][tallyqa] (simple) | 81.8 | 83.2 | 83.4 | 85.4 | 86.4 | 85.8 | |
|
| [TextCaps][textcaps] | 127.3 | 138.2 | 140.0 | 152.2 | 157.3 | 153.2 | |
|
| [TextVQA][textvqa] (val) | 59.6 | 64.0 | 64.7 | 75.2 | 76.7 | 76.2 | |
|
| [VATEX][vatex] | 80.3 | 82.9 | - | - | - | - | |
|
| [VQAv2][vqav2] (minival) | 82.7 | 84.0 | 84.5 | 84.6 | 85.8 | 85.8 | |
|
| [VizWizVQA][vizwiz-vqa] (val) | 76.2 | 77.6 | 78.6 | 77.3 | 79.0 | 78.5 | |
|
| [XM3600][xm3600] (avg35) | 42.6 | 44.5 | 45.0 | 43.3 | 44.5 | 45.2 | |
|
| [XM3600][xm3600] (en) | 80.4 | 81.1 | 81.1 | 79.6 | 81.5 | 80.8 | |
|
| [xGQA][xgqa] (avg7) | 58.8 | 61.5 | 61.0 | 60.3 | 62.8 | 61.9 | |
|
|
|
Note: Values in the table above may differ slightly from the PaliGemma 2 |
|
technical report, because these results were obtained for a single random seed. |
|
|
|
#### Additional Benchmarks |
|
|
|
**[ICDAR 2015 Incidental][icdar2015-inc]** |
|
|
|
| Model | Precision | Recall | F1 | |
|
|-----------------|-----------|:------:|:-----:| |
|
| PaliGemma 2 3B | 81.88 | 70.73 | 75.9 | |
|
|
|
**[Total-Text][total-text]** |
|
|
|
| Model | Precision | Recall | F1 | |
|
|-----------------|-----------|:------:|:-----:| |
|
| PaliGemma 2 3B | 73.8. | 74.54 | 74.17 | |
|
|
|
**[FinTabNet][fintabnet]** |
|
|
|
| Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con | |
|
|-----------------|--------|-------|-----------|-----------| |
|
| PaliGemma 2 3B | 99.18 | 98.94 | 99.43 | 99.21 | |
|
|
|
**[PubTabNet][pubtabnet]** |
|
|
|
| Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con | |
|
|-----------------|--------|-------|-----------|-----------| |
|
| PaliGemma 2 3B | 97.6 | 97.31 | 97.99 | 97.84 | |
|
|
|
**[GrandStaff][grandstaff]** |
|
|
|
| Model | CER | LER | SER | |
|
|-----------------|-----|-----|-----| |
|
| PaliGemma 2 3B | 1.6 | 6.7 | 2.3 | |
|
|
|
**[PubChem][pubchem]** |
|
|
|
* PaliGemma 2 3B, Full Match: 94.8 |
|
|
|
**[DOCCI][docci]** |
|
|
|
| Model | avg#char | avg#sent | NES % | |
|
|-----------------|----------|----------|---------| |
|
| PaliGemma 2 3B | 529 | 7.74 | 28.42 | |
|
| PaliGemma 2 10B | 521 | 7.45 | 20.27 | |
|
|
|
- *avg#char*: Average number of characters |
|
- *avg#sent*: Average number of characters |
|
- *NES*: Number of entailment sentences |
|
|
|
**[MIMIC-CXR][mimic-cxr]** |
|
|
|
| Model | CIDEr | BLEU4 | Rouge-L | RadGraph F1 | |
|
|-----------------|-------|-------|---------|-------------| |
|
| PaliGemma 2 3B | 19.9% | 14.6% | 31.92% | 28.8% | |
|
| PaliGemma 2 10B | 17.4% | 15% | 32.41% | 29.5% | |
|
|
|
**[Visual Spatial Reasoning][vsr]** |
|
|
|
| Model | VSR zeroshot split (test) | VSR random split (test) | |
|
|-----------------|---------------------------|--------------------------| |
|
| PaliGemma 2 3B | 0.75 | 0.82 | |
|
| PaliGemma 2 10B | 0.80 | 0.87 | |
|
|
|
## Ethics and safety |
|
|
|
### Evaluation approach |
|
|
|
Our evaluation methods include structured evaluations and internal red-teaming |
|
testing of relevant content policies. Red-teaming was conducted by a number of |
|
different teams, each with different goals and human evaluation metrics. These |
|
models were evaluated against a number of different categories relevant to |
|
ethics and safety, including: |
|
|
|
* Human evaluation on prompts covering child safety, content safety and |
|
representational harms. See the [Gemma model |
|
card](https://ai.google.dev/gemma/docs/model_card#evaluation_approach) for |
|
more details on evaluation approach, but with image captioning and visual |
|
question answering setups. |
|
* Image-to-Text benchmark evaluation: Benchmark against relevant academic |
|
datasets such as FairFace Dataset ([Karkkainen et al., |
|
2021](https://arxiv.org/abs/1908.04913)). |
|
### Evaluation results |
|
|
|
* The human evaluation results of ethics and safety evaluations are within |
|
acceptable thresholds for meeting [internal |
|
policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11) |
|
for categories such as child safety, content safety and representational |
|
harms. |
|
* On top of robust internal evaluations, we also use the Perspective API |
|
(threshold of 0.8) to measure toxicity, profanity, and other potential |
|
issues in the generated captions for images sourced from the FairFace |
|
dataset. We report the maximum and median values observed across subgroups |
|
for each of the perceived gender, ethnicity, and age attributes. |
|
<table> |
|
<tr> |
|
<col> |
|
<colgroup span="3"></colgroup> |
|
<colgroup span="3"></colgroup> |
|
<colgroup span="3"></colgroup> |
|
<th>Metric</th> |
|
<th colspan="3" scope="colgroup">Perceived gender</th> |
|
<th colspan="3" scope="colgroup">Ethnicity</th> |
|
<th colspan="3" scope="colgroup">Age group</th> |
|
</tr> |
|
<tr> |
|
<th>Model size</th> |
|
<th scope="col">2B</th> |
|
<th scope="col">9B</th> |
|
<th scope="col">27B</th> |
|
<th scope="col">2B</th> |
|
<th scope="col">9B</th> |
|
<th scope="col">27B</th> |
|
<th scope="col">2B</th> |
|
<th scope="col">9B</th> |
|
<th scope="col">27B</th> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th colspan="9" scope="colgroup">Maximum</th> |
|
</tr> |
|
<tr> |
|
<td>Toxicity</td> |
|
<td>0.14%</td> |
|
<td>0.15%</td> |
|
<td>0.19%</td> |
|
<td>0.29%</td> |
|
<td>0.39%</td> |
|
<td>0.39%</td> |
|
<td>0.26%</td> |
|
<td>0.18%</td> |
|
<td>0.32%</td> |
|
</tr> |
|
<tr> |
|
<td>Identity Attack</td> |
|
<td>0.04%</td> |
|
<td>0.02%</td> |
|
<td>0.02%</td> |
|
<td>0.13%</td> |
|
<td>0.06%</td> |
|
<td>0.06%</td> |
|
<td>0.06%</td> |
|
<td>0.03%</td> |
|
<td>0.06%</td> |
|
</tr> |
|
<tr> |
|
<td>Insult</td> |
|
<td>0.17%</td> |
|
<td>0.25%</td> |
|
<td>0.17%</td> |
|
<td>0.37%</td> |
|
<td>0.52%</td> |
|
<td>0.52%</td> |
|
<td>0.27%</td> |
|
<td>0.39%</td> |
|
<td>0.24%</td> |
|
</tr> |
|
<tr> |
|
<td>Threat</td> |
|
<td>0.55%</td> |
|
<td>0.43%</td> |
|
<td>0.57%</td> |
|
<td>0.83%</td> |
|
<td>0.48%</td> |
|
<td>0.48%</td> |
|
<td>0.64%</td> |
|
<td>0.43%</td> |
|
<td>0.64%</td> |
|
</tr> |
|
<tr> |
|
<td>Profanity</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th colspan="9" scope="colgroup">Median</th> |
|
</tr> |
|
<tr> |
|
<td>Toxicity</td> |
|
<td>0.13%</td> |
|
<td>0.10%</td> |
|
<td>0.18%</td> |
|
<td>0.07%</td> |
|
<td>0.07%</td> |
|
<td>0.14%</td> |
|
<td>0.12%</td> |
|
<td>0.08%</td> |
|
<td>0.12%</td> |
|
</tr> |
|
<tr> |
|
<td>Identity Attack</td> |
|
<td>0.02%</td> |
|
<td>0.01%</td> |
|
<td>0.02%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Insult</td> |
|
<td>0.15%</td> |
|
<td>0.23%</td> |
|
<td>0.14%</td> |
|
<td>0.14%</td> |
|
<td>0.17%</td> |
|
<td>0.13%</td> |
|
<td>0.09%</td> |
|
<td>0.18%</td> |
|
<td>0.16%</td> |
|
</tr> |
|
<tr> |
|
<td>Threat</td> |
|
<td>0.35%</td> |
|
<td>0.27%</td> |
|
<td>0.41%</td> |
|
<td>0.28%</td> |
|
<td>0.19%</td> |
|
<td>0.42%</td> |
|
<td>0.27%</td> |
|
<td>0.31%</td> |
|
<td>0.40%</td> |
|
</tr> |
|
<tr> |
|
<td>Profanity</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
</table> |
|
## Usage and limitations |
|
|
|
### Intended usage |
|
|
|
Open Vision Language Models (VLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
Fine-tune on specific vision-language task: |
|
|
|
* The pre-trained models can be fine-tuned on a wide range of vision-language |
|
tasks such as: image captioning, short video caption, visual question |
|
answering, text reading, object detection and object segmentation. |
|
* The pre-trained models can be fine-tuned for specific domains such as remote |
|
sensing question answering, visual questions from people who are blind, |
|
science question answering, describe UI element functionalities. |
|
* The pre-trained models can be fine-tuned for tasks with non-textual outputs |
|
such as bounding boxes or segmentation masks. |
|
Vision-language research: |
|
|
|
* The pre-trained models and fine-tuned models can serve as a foundation for |
|
researchers to experiment with VLM techniques, develop algorithms, and |
|
contribute to the advancement of the field. |
|
### Ethical considerations and risks |
|
|
|
The development of vision-language models (VLMs) raises several ethical |
|
concerns. In creating an open model, we have carefully considered the following: |
|
|
|
* Bias and Fairness |
|
* VLMs trained on large-scale, real-world image-text data can reflect |
|
socio-cultural biases embedded in the training material. These models |
|
underwent careful scrutiny, input data pre-processing described and |
|
posterior evaluations reported in this card. |
|
* Misinformation and Misuse |
|
* VLMs can be misused to generate text that is false, misleading, or |
|
harmful. |
|
* Guidelines are provided for responsible use with the model, see the |
|
[Responsible Generative AI Toolkit](https://ai.google.dev/responsible). |
|
* Transparency and Accountability |
|
* This model card summarizes details on the models' architecture, |
|
capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share |
|
innovation by making VLM technology accessible to developers and |
|
researchers across the AI ecosystem. |
|
Risks identified and mitigations: |
|
|
|
* **Perpetuation of biases:** It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* **Generation of harmful content:** Mechanisms and guidelines for content |
|
safety are essential. Developers are encouraged to exercise caution and |
|
implement appropriate content safety safeguards based on their specific |
|
product policies and application use cases. |
|
* **Misuse for malicious purposes:** Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided: see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible). |
|
Prohibited uses of Gemma models are outlined in the |
|
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
* **Privacy violations:** Models were trained on data filtered to remove |
|
certain personal information and sensitive data. Developers are encouraged |
|
to adhere to privacy regulations with privacy-preserving techniques. |
|
### Limitations |
|
|
|
* Most limitations inherited from the underlying Gemma 2 models still apply: |
|
* VLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* Natural language is inherently complex. VLMs might struggle to grasp |
|
subtle nuances, sarcasm, or figurative language. |
|
* VLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* VLMs rely on statistical patterns in language and images. They might |
|
lack the ability to apply common sense reasoning in certain situations. |
|
* PaliGemma 2 was designed first and foremost to serve as a general |
|
pre-trained model for fine-tune to specialized tasks. Hence, its "out of |
|
the box" or "zero-shot" performance might lag behind models designed |
|
specifically for that. |
|
* PaliGemma 2 is not a multi-turn chatbot. It is designed for a single round |
|
of image and text input. |
|
|
|
[ai2d]: https://allenai.org/data/diagrams |
|
[aokvqa-da]: https://allenai.org/project/a-okvqa/home |
|
[aokvqa-mc]: https://allenai.org/project/a-okvqa/home |
|
[anet-cap]: https://paperswithcode.com/dataset/activitynet-captions |
|
[anet-qa]: https://arxiv.org/abs/1906.02467 |
|
[chartqa]: https://arxiv.org/abs/2203.10244 |
|
[coco-35l]: https://arxiv.org/pdf/2205.12522 |
|
[coco-cap]: https://cocodataset.org/#home |
|
[countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/ |
|
[docvqa]: https://www.docvqa.org/ |
|
[gqa]: https://cs.stanford.edu/people/dorarad/gqa/about.html |
|
[info-vqa]: https://arxiv.org/abs/2104.12756 |
|
[marvl]: https://marvl-challenge.github.io/ |
|
[msrvtt]: https://paperswithcode.com/dataset/msr-vtt |
|
[msvd-qa]: https://paperswithcode.com/dataset/msvd-qa |
|
[nlvr2]: https://lil.nlp.cornell.edu/nlvr/ |
|
[nocaps]: https://nocaps.org/ |
|
[ocr-vqa]: https://ocr-vqa.github.io/ |
|
[okvqa]: https://okvqa.allenai.org/ |
|
[refcoco]: https://arxiv.org/abs/1608.00272 |
|
[refcoco+]: https://aclanthology.org/D14-1086 |
|
[refcocog]: https://arxiv.org/abs/1511.02283 |
|
[rsvqa-hr]: https://zenodo.org/records/6344367 |
|
[rsvqa-lr]: https://zenodo.org/records/6344334 |
|
[st-vqa]: https://arxiv.org/abs/1905.13648 |
|
[scicap]: https://arxiv.org/abs/2110.11624 |
|
[scienceqa]: https://scienceqa.github.io/ |
|
[screen2words]: https://arxiv.org/abs/2108.03353 |
|
[tallyqa]: https://arxiv.org/abs/1810.12440 |
|
[textcaps]: https://textvqa.org/textcaps/ |
|
[textvqa]: https://textvqa.org/ |
|
[vatex]: https://arxiv.org/abs/1904.03493 |
|
[vizwiz-vqa]: https://vizwiz.org/tasks-and-datasets/vqa/ |
|
[vqav2]: https://visualqa.org/index.html |
|
[xgqa]: https://aclanthology.org/2022.findings-acl.196/ |
|
[xm3600]: https://arxiv.org/pdf/2205.12522 |
|
|
|
[icdar2015-inc]: https://arxiv.org/abs/1511.09207 |
|
[total-text]: https://paperswithcode.com/paper/total-text-a-comprehensive-dataset-for-scene |
|
[fintabnet]: https://developer.ibm.com/data/fintabnet/ |
|
[pubtabnet]: https://paperswithcode.com/dataset/pubtabnet |
|
[grandstaff]: https://link.springer.com/article/10.1007/s10032-023-00432-z |
|
[pubchem]: https://pmc.ncbi.nlm.nih.gov/articles/PMC7352161/ |
|
[docci]: https://research.google/pubs/docci-descriptions-of-connected-and-contrasting-images/ |
|
[mimic-cxr]: https://paperswithcode.com/dataset/mimic-cxr |
|
[vsr]: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00566/116470/Visual-Spatial-Reasoning |