yuxi-liu-wired
/

CSD

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

yuxi-liu-wired commited on Sep 18, 2024

Commit

386dad3

·

verified ·

1 Parent(s): 9eedb6b

Update README.md

Files changed (1) hide show

README.md +15 -0

README.md CHANGED Viewed

@@ -31,6 +31,21 @@ Now, remove the projection matrix. This gives us $g: \text{Image} \to \R^{1024}$
 The original paper actually stated that they trained *two* models, and one of them was based on ViT-B, but they did not release it.
 Also, despite the names `style vector` and `content vector`, I have noticed by visual inspection that both are basically equally good for style embedding. I don't know why, but I guess that's life?
 ## How to use it

 The original paper actually stated that they trained *two* models, and one of them was based on ViT-B, but they did not release it.
+The model takes as input real-valued tensors. To preprocess images, use the CLIP preprocessor. That is, use `_, preprocess = clip.load("ViT-L/14")`. Explicitly, the preprocessor performs the following operation:
+```python
+def _transform(n_px):
+    return Compose([
+        Resize(n_px, interpolation=BICUBIC),
+        CenterCrop(n_px),
+        _convert_image_to_rgb,
+        ToTensor(),
+        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+    ])
+```
+See the documentation for [`CLIPImageProcessor` for details](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor).
 Also, despite the names `style vector` and `content vector`, I have noticed by visual inspection that both are basically equally good for style embedding. I don't know why, but I guess that's life?
 ## How to use it