|
--- |
|
library_name: keras-hub |
|
--- |
|
### Model Overview |
|
# Model Summary |
|
|
|
This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space. |
|
|
|
|
|
Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE). |
|
|
|
## Links |
|
|
|
* [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification) |
|
* [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/) |
|
* [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip) |
|
|
|
## Installation |
|
|
|
Keras and KerasCV can be installed with: |
|
|
|
``` |
|
pip install -U -q keras-cv |
|
pip install -U -q keras>=3 |
|
``` |
|
|
|
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. |
|
|
|
## Presets |
|
|
|
The following model checkpoints are provided by the Keras team. Full code examples for each are available below. |
|
| Preset name | Parameters | Description | |
|
|----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| clip-vit-base-patch16 | 149.62M | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) | |
|
| clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) | |
|
| clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) | |
|
| clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) | |
|
|
|
## Example code |
|
``` |
|
from keras import ops |
|
import keras |
|
from keras_cv.models.feature_extractor.clip import CLIPProcessor |
|
from keras_cv.models import CLIP |
|
|
|
processor = CLIPProcessor("vocab.json", "merges.txt") |
|
# processed_image = transform_image("cat.jpg", 224) |
|
tokens = processor(["mountains", "cat on tortoise", "house"]) |
|
model = CLIP.from_preset("clip-vit-base-patch32") |
|
output = model({ |
|
"images": processed_image, |
|
"token_ids": tokens['token_ids'], |
|
"padding_mask": tokens['padding_mask']}) |
|
|
|
|
|
# optional if you need to pre process image |
|
def transform_image(image_path, input_resolution): |
|
mean = ops.array([0.48145466, 0.4578275, 0.40821073]) |
|
std = ops.array([0.26862954, 0.26130258, 0.27577711]) |
|
|
|
image = keras.utils.load_img(image_path) |
|
image = keras.utils.img_to_array(image) |
|
image = ( |
|
ops.image.resize( |
|
image, |
|
(input_resolution, input_resolution), |
|
interpolation="bicubic", |
|
) |
|
/ 255.0 |
|
) |
|
central_fraction = input_resolution / image.shape[0] |
|
width, height = image.shape[0], image.shape[1] |
|
left = ops.cast((width - width * central_fraction) / 2, dtype="int32") |
|
top = ops.cast((height - height * central_fraction) / 2, dtype="int32") |
|
right = ops.cast((width + width * central_fraction) / 2, dtype="int32") |
|
bottom = ops.cast( |
|
(height + height * central_fraction) / 2, dtype="int32" |
|
) |
|
|
|
image = ops.slice( |
|
image, [left, top, 0], [right - left, bottom - top, 3] |
|
) |
|
|
|
image = (image - mean) / std |
|
return ops.expand_dims(image, axis=0) |
|
``` |
|
|
|
|