Divyasreepat commited on
Commit
836eb0d
1 Parent(s): 1203cbb

Update README.md with new model card content

Browse files
Files changed (1) hide show
  1. README.md +84 -9
README.md CHANGED
@@ -1,12 +1,87 @@
1
  ---
2
  library_name: keras-hub
3
  ---
4
- This is a [`CLIP` model](https://keras.io/api/keras_hub/models/clip) uploaded using the KerasHub library and can be used with JAX, TensorFlow, and PyTorch backends.
5
- Model config:
6
- * **name:** clip_backbone
7
- * **trainable:** True
8
- * **vision_encoder:** {'module': 'keras_hub.src.models.clip.clip_vision_encoder', 'class_name': 'CLIPVisionEncoder', 'config': {'name': 'clip_vision_encoder', 'trainable': True, 'patch_size': 32, 'hidden_dim': 768, 'num_layers': 12, 'num_heads': 12, 'intermediate_dim': 3072, 'intermediate_activation': 'quick_gelu', 'intermediate_output_index': None, 'image_shape': [224, 224, 3]}, 'registered_name': 'keras_hub>CLIPVisionEncoder'}
9
- * **text_encoder:** {'module': 'keras_hub.src.models.clip.clip_text_encoder', 'class_name': 'CLIPTextEncoder', 'config': {'name': 'clip_text_encoder', 'trainable': True, 'vocabulary_size': 49408, 'embedding_dim': 512, 'hidden_dim': 512, 'num_layers': 12, 'num_heads': 8, 'intermediate_dim': 2048, 'intermediate_activation': 'quick_gelu', 'intermediate_output_index': None, 'max_sequence_length': 77}, 'registered_name': 'keras_hub>CLIPTextEncoder'}
10
- * **projection_dim:** 512
11
-
12
- This model card has been generated automatically and should be completed by the model author. See [Model Cards documentation](https://huggingface.co/docs/hub/model-cards) for more information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: keras-hub
3
  ---
4
+ ### Model Overview
5
+ # Model Summary
6
+
7
+ This model is a CLIP (Contrastive Language-Image Pre-training) neural network. CLIP revolutionizes image understanding by learning visual concepts from natural language descriptions found online. It's been trained on a massive dataset of image-text pairs, allowing it to excel at tasks like zero-shot image classification, image search based on text queries, and robust visual understanding. With CLIP, you can explore the power of aligning image and text representations within a shared embedding space.
8
+
9
+
10
+ Weights are released under the [MIT License](https://opensource.org/license/mit). Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).
11
+
12
+ ## Links
13
+
14
+ * [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification)
15
+ * [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/)
16
+ * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
17
+
18
+ ## Installation
19
+
20
+ Keras and KerasCV can be installed with:
21
+
22
+ ```
23
+ pip install -U -q keras-cv
24
+ pip install -U -q keras>=3
25
+ ```
26
+
27
+ Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
28
+
29
+ ## Presets
30
+
31
+ The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
32
+ | Preset name | Parameters | Description |
33
+ |----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
34
+ | clip-vit-base-patch16 | 149.62M | The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The model uses a patch size of 16 and input images of size (224, 224) |
35
+ | clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
36
+ | clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
37
+ | clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
38
+
39
+ ## Example code
40
+ ```
41
+ from keras import ops
42
+ import keras
43
+ from keras_cv.models.feature_extractor.clip import CLIPProcessor
44
+ from keras_cv.models import CLIP
45
+
46
+ processor = CLIPProcessor("vocab.json", "merges.txt")
47
+ # processed_image = transform_image("cat.jpg", 224)
48
+ tokens = processor(["mountains", "cat on tortoise", "house"])
49
+ model = CLIP.from_preset("clip-vit-base-patch32")
50
+ output = model({
51
+ "images": processed_image,
52
+ "token_ids": tokens['token_ids'],
53
+ "padding_mask": tokens['padding_mask']})
54
+
55
+
56
+ # optional if you need to pre process image
57
+ def transform_image(image_path, input_resolution):
58
+ mean = ops.array([0.48145466, 0.4578275, 0.40821073])
59
+ std = ops.array([0.26862954, 0.26130258, 0.27577711])
60
+
61
+ image = keras.utils.load_img(image_path)
62
+ image = keras.utils.img_to_array(image)
63
+ image = (
64
+ ops.image.resize(
65
+ image,
66
+ (input_resolution, input_resolution),
67
+ interpolation="bicubic",
68
+ )
69
+ / 255.0
70
+ )
71
+ central_fraction = input_resolution / image.shape[0]
72
+ width, height = image.shape[0], image.shape[1]
73
+ left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
74
+ top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
75
+ right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
76
+ bottom = ops.cast(
77
+ (height + height * central_fraction) / 2, dtype="int32"
78
+ )
79
+
80
+ image = ops.slice(
81
+ image, [left, top, 0], [right - left, bottom - top, 3]
82
+ )
83
+
84
+ image = (image - mean) / std
85
+ return ops.expand_dims(image, axis=0)
86
+ ```
87
+