yangapku commited on
Commit
72e6325
1 Parent(s): f029f78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -34
README.md CHANGED
@@ -10,42 +10,36 @@ license: apache-2.0
10
  This is the large-version of the Chinese CLIP, with ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP
11
 
12
  ## Use with the official API
13
- We provide a simple code snippet to show how to use the API for Chinese-CLIP. For starters, please install cn_clip:
14
- ```bash
15
- # to install the latest stable release
16
- pip install cn_clip
17
-
18
- # or install from source code
19
- cd Chinese-CLIP
20
- pip install -e .
21
- ```
22
- After installation, use Chinese CLIP as shown below:
23
  ```python
24
- import torch
25
  from PIL import Image
26
-
27
- import cn_clip.clip as clip
28
- from cn_clip.clip import load_from_name, available_models
29
- print("Available models:", available_models())
30
- # Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
31
-
32
- device = "cuda" if torch.cuda.is_available() else "cpu"
33
- model, preprocess = load_from_name("ViT-L-14-336", device=device, download_root='./')
34
- model.eval()
35
- image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
36
- text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
37
-
38
- with torch.no_grad():
39
- image_features = model.encode_image(image)
40
- text_features = model.encode_text(text)
41
- # Normalize the features. Please use the normalized features for downstream tasks.
42
- image_features /= image_features.norm(dim=-1, keepdim=True)
43
- text_features /= text_features.norm(dim=-1, keepdim=True)
44
-
45
- logits_per_image, logits_per_text = model.get_similarity(image, text)
46
- probs = logits_per_image.softmax(dim=-1).cpu().numpy()
47
-
48
- print("Label probs:", probs) # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]
 
 
 
49
  ```
50
 
51
  However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.
 
10
  This is the large-version of the Chinese CLIP, with ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP
11
 
12
  ## Use with the official API
13
+ We provide a simple code snippet to show how to use the API of Chinese-CLIP to compute the image & text embeddings and similarities.
14
+
 
 
 
 
 
 
 
 
15
  ```python
 
16
  from PIL import Image
17
+ import requests
18
+ from transformers import ChineseCLIPProcessor, ChineseCLIPModel
19
+
20
+ model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
21
+ processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
22
+
23
+ url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
24
+ image = Image.open(requests.get(url, stream=True).raw)
25
+ # Squirtle, Bulbasaur, Charmander, Pikachu in English
26
+ texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
27
+
28
+ # compute image feature
29
+ inputs = processor(images=image, return_tensors="pt")
30
+ image_features = model.get_image_features(**inputs)
31
+ image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize
32
+
33
+ # compute text features
34
+ inputs = processor(text=texts, padding=True, return_tensors="pt")
35
+ text_features = model.get_text_features(**inputs)
36
+ text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize
37
+
38
+ # compute image-text similarity scores
39
+ inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
40
+ outputs = model(**inputs)
41
+ logits_per_image = outputs.logits_per_image # this is the image-text similarity score
42
+ probs = logits_per_image.softmax(dim=1) # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]
43
  ```
44
 
45
  However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.