metadata

library_name: keras-hub

Model Overview

Model Summary

This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.

Weights are released under the MIT License. Keras model code is released under the Apache 2 License.

Installation

Keras and KerasHub can be installed with:

pip install -U -q keras-hub
pip install -U -q keras

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.

Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset name	Parameters	Description
clip-vit-base-patch16	149.62M	The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224)
clip-vit-base-patch32	151.28M	The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224)
clip-vit-large-patch14	427.62M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224)
clip-vit-large-patch14-336	427.94M	The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336)
clip_vit_b_32_laion2b_s34b_b79k	151.28M	151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model.
clip_vit_h_14_laion2b_s32b_b79k	986.11M	986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model.
clip_vit_g_14_laion2b_s12b_b42k	1.37B	1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model.
clip_vit_bigg_14_laion2b_39b_b160k	2.54B	2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model.

Example Usage

import keras
import numpy as np
import matplotlib.pyplot as plt
from keras_hub.models import CLIPBackbone, CLIPTokenizer
from keras_hub.layers import CLIPImageConverter

# instantiate the model and preprocessing tools
clip = CLIPBackbone.from_preset("clip_vit_large_patch14_336")
tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14_336",
sequence_length=5)
image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14_336")

# obtain tokens for some input text
tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])

# preprocess image and text
image = keras.utils.load_img("cat.jpg")
image = image_converter(np.array([image]).astype(float))

# query the model for similarities
clip({
     "images": image,
     "token_ids": tokens,
})

Example Usage with Hugging Face URI

import keras
import numpy as np
import matplotlib.pyplot as plt
from keras_hub.models import CLIPBackbone, CLIPTokenizer
from keras_hub.layers import CLIPImageConverter

# instantiate the model and preprocessing tools
clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14_336")
tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14_336",
sequence_length=5)
image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14_336")

# obtain tokens for some input text
tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])

# preprocess image and text
image = keras.utils.load_img("cat.jpg")
image = image_converter(np.array([image]).astype(float))

# query the model for similarities
clip({
     "images": image,
     "token_ids": tokens,
})

keras
/

clip_vit_large_patch14_336