|
--- |
|
library_name: keras-nlp |
|
extra_gated_heading: Access PaliGemma on Hugging Face |
|
extra_gated_prompt: >- |
|
To access PaliGemma on Hugging Face, you’re required to review and agree to |
|
Google’s usage license. To do this, please ensure you’re logged-in to Hugging |
|
Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
license: gemma |
|
pipeline_tag: image-text-to-text |
|
--- |
|
PaliGemma is a set of multi-modal large language models published by Google based on the Gemma model. Both a pre-trained and instruction tuned models are available. See the model card below for benchmarks, data sources, and intended use cases. |
|
|
|
|
|
## Links |
|
|
|
* [PaliGemma API Documentation](https://keras.io/api/keras_nlp/models/pali_gemma/) |
|
* [KerasNLP Beginner Guide](https://keras.io/guides/keras_nlp/getting_started/) |
|
* [KerasNLP Model Publishing Guide](https://keras.io/guides/keras_nlp/upload/) |
|
|
|
## Installation |
|
|
|
Keras and KerasNLP can be installed with: |
|
|
|
``` |
|
pip install -U -q keras-nlp |
|
pip install -U -q keras>=3 |
|
``` |
|
|
|
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instruction on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. |
|
|
|
## Presets |
|
|
|
The following model checkpoints are provided by the Keras team. Full code examples for each are available below. |
|
|
|
| Preset name | Parameters | Description | |
|
|-----------------------|------------|-------------------------------------------------------------| |
|
| [paligemma-3b-224-mix-keras](https://huggingface.co/google/paligemma-3b-224-mix-keras) | 2.92B | image size 224, mix fine tuned, text sequence length is 256 | |
|
| [paligemma-3b-448-mix-keras](https://huggingface.co/google/paligemma-3b-448-mix-keras) | 2.92B | image size 448, mix fine tuned, text sequence length is 512 | |
|
| [paligemma-3b-224-keras](https://huggingface.co/google/paligemma-3b-224-keras) | 2.92B | image size 224, pre trained, text sequence length is 128 | |
|
| [paligemma-3b-448-keras](https://huggingface.co/google/paligemma-3b-448-keras) | 2.92B | image size 448, pre trained, text sequence length is 512 | |
|
| [**paligemma-3b-896-keras**](https://huggingface.co/google/paligemma-3b-896-keras) | 2.93B | image size 896, pre trained, text sequence length is 512 | |
|
|
|
## Prompts |
|
|
|
The PaliGemma `"mix"` models can handle a number of prompting structures out of the box. It is important to stick exactly to these prompts, including the newline. Lang can be a language code such as `"en"` or `"fr"`. Support for languages outside of English will vary depending on the prompt type. |
|
|
|
* `"cap {lang}\n"`: very raw short caption (from WebLI-alt). |
|
* `"caption {lang}\n"`: coco-like short captions. |
|
* `"describe {lang}\n"`: somewhat longer more descriptive captions. |
|
* `"ocr\n"`: optical character recognition. |
|
* `"answer en {question}\n"`: question answering about the image contents. |
|
* `"question {lang} {answer}\n"`: question generation for a given answer. |
|
* `"detect {thing} ; {thing}\n"`: count objects in a scene. |
|
|
|
Not `"mix"` presets should be fine-tuned for a specific task. |
|
|
|
``` |
|
!pip install -U -q keras-nlp |
|
``` |
|
Pick a backend of your choice |
|
``` |
|
import os |
|
os.environ["KERAS_BACKEND"] = "jax" |
|
``` |
|
Now we can load the PaliGemma "causal language model" from the Kaggle Models hub. A causal language model is just a LLM that is ready for generation, by training with a causal mask, and running generation a token at a time in a recurrent loop. |
|
``` |
|
keras.config.set_floatx("bfloat16") |
|
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset( |
|
"hf://google/paligemma-3b-896-keras" |
|
) |
|
``` |
|
Function that reads an image from a given URL |
|
``` |
|
def read_image(url): |
|
contents = io.BytesIO(requests.get(url).content) |
|
image = PIL.Image.open(contents) |
|
image = np.array(image) |
|
# Remove alpha channel if neccessary. |
|
if image.shape[2] == 4: |
|
image = image[:, :, :3] |
|
return image |
|
``` |
|
``` |
|
image_url = 'https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png' |
|
image = read_image(image_url) |
|
``` |
|
Use `generate()` call with a single image and prompt. The text prompt |
|
has to end with `\n`. |
|
``` |
|
prompt = 'answer en where is the cow standing?\n' |
|
output = pali_gemma_lm.generate( |
|
inputs={ |
|
"images": image, |
|
"prompts": prompt, |
|
} |
|
) |
|
print(output) |
|
``` |
|
Use `generate()` call with a batched images and prompts. |
|
``` |
|
prompts = [ |
|
'answer en where is the cow standing?\n', |
|
'answer en what color is the cow?\n', |
|
'describe en\n', |
|
'detect cow\n', |
|
'segment cow\n', |
|
] |
|
images = [image, image, image, image, image] |
|
outputs = pali_gemma_lm.generate( |
|
inputs={ |
|
"images": images, |
|
"prompts": prompts, |
|
} |
|
) |
|
for output in outputs: |
|
print(output) |
|
``` |
|
|
|
There's a few other style of prompts this model can handle out of the box... |
|
|
|
`cap {lang}\n`: very raw short caption (from WebLI-alt). |
|
|
|
`caption {lang}\n`: nice, coco-like short captions. |
|
|
|
`describe {lang}\n`: somewhat longer more descriptive captions. |
|
|
|
`ocr\n`: optical character recognition. |
|
|
|
`answer en {question}\n`: question answering about the image contents. |
|
|
|
`question {lang} {answer}\n`: question generation for a given answer. |
|
|
|
`detect {thing} ; {thing}\n`: count objects in a scene. |
|
|
|
Call `fit()` on a single batch |
|
``` |
|
import numpy as np |
|
image = np.random.uniform(-1, 1, size=(224, 224, 3)) |
|
x = { |
|
"images": [image, image], |
|
"prompts": ["answer en Where is the cow standing?\n", "caption en\n"], |
|
} |
|
y = { |
|
"responses": ["beach", "A brown cow standing on a beach next to the ocean."], |
|
} |
|
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset("hf://google/paligemma-3b-896-keras") |
|
pali_gemma_lm.fit(x=x, y=y, batch_size=2) |
|
``` |