Image Classification
AIoT
QNN
qc903113684 commited on
Commit
38b40d0
verified
1 Parent(s): 460154d

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ CLIP.png filter=lfs diff=lfs merge=lfs -text
37
+ code/python/CLIP.png filter=lfs diff=lfs merge=lfs -text
CLIP.png ADDED

Git LFS Details

  • SHA256: 308a3ca4503f1c7a07803916c369d78c4ef501e5ab7fc727da9b5e1d2f9ec85b
  • Pointer size: 131 Bytes
  • Size of remote file: 252 kB
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: aplux-model-farm-license
4
+ license_link: https://aiot.aidlux.com/api/v1/files/license/model_farm_license_en.pdf
5
+ pipeline_tag: image-classification
6
+ tags:
7
+ - AIoT
8
+ - QNN
9
+ ---
10
+
11
+ ![alt text](CLIP.png)
12
+
13
+ ## Model Details
14
+
15
+ The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they鈥檙e being deployed within.
16
+
17
+ ### Model Type
18
+
19
+ The base model uses a ResNet50 with several modifications as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
20
+
21
+ ### Model Versions
22
+
23
+ Initially, we鈥檝e released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
24
+
25
+ As part of the staged release process, we have also released the RN101 model, as well as RN50x4, a RN50 scaled up 4x according to the [EfficientNet](https://arxiv.org/abs/1905.11946) scaling rule. In July 2021, we additionally released the RN50x16 and ViT-B/16 models, and in January 2022, the RN50x64 and ViT-L/14 models were released. Lastly, the ViT-L/14@336px model was released in April 2022.
26
+
27
+ Please see the paper linked below for further details about their specification.
28
+
29
+
30
+ ### Source model
31
+
32
+ - Input shape: [1x3x224x224], [1x77]
33
+ - Number of parameters: 82.25M, 60.61M
34
+ - Model size: 329.00M, 242.44M
35
+ - Output shape: [1x512], [1x512]
36
+
37
+ The source model can be found聽[here](https://github.com/openai/CLIP)
38
+
39
+ ## Performance Reference
40
+
41
+ Please search model by model name in [Model Farm](https://aiot.aidlux.com/en/models)
42
+
43
+ ## Inference & Model Conversion
44
+
45
+ Please search model by model name in [Model Farm](https://aiot.aidlux.com/en/models)
46
+
47
+ ## License
48
+
49
+ - Source Model: [MIT](https://github.com/openai/CLIP/blob/main/LICENSE)
50
+
51
+ - Deployable Model: [APLUX-MODEL-FARM-LICENSE](https://aiot.aidlux.com/api/v1/files/license/model_farm_license_en.pdf)
code/README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Model Information
2
+
3
+ ### Source model
4
+ - Input shape: [1x3x224x224], [1x77]
5
+ - Number of parameters: 43.02M, 105.16M
6
+ - Model size: 172.10M, 122.99M
7
+ - Output shape: [1x512], [1x512]
8
+
9
+ Source model repository: [clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16)
10
+
11
+ ## Inference with AidLite SDK
12
+
13
+ ### SDK installation
14
+ Model Farm uses AidLite SDK as the model inference SDK. For details, please refer to the https://docs.aidlux.com/software/ai-sdk/aidlite_guide(https://docs.aidlux.com/software/ai-sdk/aidlite_guide)
15
+
16
+ - Install AidLite SDK
17
+
18
+ ```bash
19
+ # Install the appropriate version of the aidlite sdk
20
+ sudo aid-pkg update
21
+ sudo aid-pkg install aidlite-sdk
22
+ # Download the qnn version that matches the above backend. Eg Install QNN2.36 Aidlite: sudo aid-pkg install aidlite-qnn236
23
+ sudo aid-pkg install aidlite-{QNN VERSION}
24
+ ```
25
+
26
+ - Verify AidLite SDK
27
+
28
+ ```bash
29
+ # aidlite sdk c++ check
30
+ python3 -c "import aidlite ; print(aidlite.get_library_version())"
31
+
32
+ # aidlite sdk python check
33
+ python3 -c "import aidlite ; print(aidlite.get_py_library_version())"
34
+ ```
35
+
36
+ ### Run Demo
37
+ ```bash
38
+ # Environment setup
39
+ pip install ftfy packaging regex tqdm Pillow numpy
40
+ # Run example
41
+ cd model_farm_clip-vit-16_qcs8550_qnn2.36_fp16_aidlite/python
42
+ python3 run_test.py
43
+ ```
code/python/CLIP.png ADDED

Git LFS Details

  • SHA256: 308a3ca4503f1c7a07803916c369d78c4ef501e5ab7fc727da9b5e1d2f9ec85b
  • Pointer size: 131 Bytes
  • Size of remote file: 252 kB
code/python/bpe_simple_vocab_16e6.txt.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
3
+ size 1356917
code/python/run_test.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utils import preprocess, tokenize
2
+ from PIL import Image
3
+ import numpy as np
4
+ import aidlite
5
+
6
+
7
+ def create_model(model_path: str, input_tensor_shape: list, output_tensor_shape: list) -> aidlite.soaidlitesdk.Interpreter:
8
+ model = aidlite.Model.create_instance(model_path)
9
+ model.set_model_properties(input_tensor_shape, aidlite.DataType.TYPE_FLOAT32,
10
+ output_tensor_shape, aidlite.DataType.TYPE_FLOAT32)
11
+ config = aidlite.Config.create_instance()
12
+ config.implement_type = aidlite.ImplementType.TYPE_LOCAL
13
+ config.framework_type = aidlite.FrameworkType.TYPE_QNN
14
+ config.accelerate_type = aidlite.AccelerateType.TYPE_DSP
15
+ config.number_of_threads = 4
16
+ interpreter = aidlite.InterpreterBuilder.build_interpretper_from_model_and_config(
17
+ model, config)
18
+ if interpreter is None:
19
+ raise RuntimeError("build_interpretper_from_model_and_config failed")
20
+
21
+ if interpreter.init() != 0:
22
+ raise RuntimeError("interpreter init failed")
23
+
24
+ if interpreter.load_model() != 0:
25
+ raise RuntimeError("interpreter load model failed")
26
+
27
+ return interpreter
28
+
29
+
30
+ def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
31
+ x = np.asarray(x, dtype=np.float32)
32
+ norm = np.linalg.norm(x, axis=axis, keepdims=True)
33
+ return x / (norm + eps)
34
+
35
+
36
+ def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
37
+ x = np.asarray(x, dtype=np.float32)
38
+ x = x - np.max(x, axis=axis, keepdims=True)
39
+ exp_x = np.exp(x)
40
+ return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
41
+
42
+
43
+ visual_model_path = "../models/clip_visual_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin"
44
+ text_model_path = "../models/clip_text_encoder_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin"
45
+
46
+ text_model = create_model(text_model_path, [[1, 77]], [[1, 512]])
47
+ visual_model = create_model(visual_model_path, [[1, 224, 224, 3]], [[1, 512]])
48
+
49
+
50
+ visual_model.set_input_tensor("image", preprocess(
51
+ Image.open("CLIP.png")).transpose(0, 2, 3, 1))
52
+ visual_model.invoke()
53
+ image_features = visual_model.get_output_tensor("image_features")
54
+
55
+
56
+ texts = ["a dog", "a cat", "a diagram"]
57
+ text_features = []
58
+ for text in texts:
59
+ text_input = tokenize(text)
60
+ text_model.set_input_tensor("text", text_input.astype(np.float32))
61
+ text_model.invoke()
62
+ text_out = text_model.get_output_tensor("text_features")
63
+ text_features.append(np.asarray(text_out, dtype=np.float32).reshape(1, -1))
64
+
65
+ visual_model.destory()
66
+ text_model.destory()
67
+
68
+ image_features = np.asarray(image_features, dtype=np.float32).reshape(1, -1)
69
+ text_features = np.concatenate(text_features, axis=0) # [num_texts, 512]
70
+
71
+ image_features = l2_normalize(image_features, axis=1)
72
+ text_features = l2_normalize(text_features, axis=1)
73
+
74
+
75
+ logit_scale = 101.88
76
+ similarity = logit_scale * (image_features @ text_features.T) # [1, num_texts]
77
+ probs = softmax(similarity, axis=1)
78
+
79
+ print("texts:", texts)
80
+ print("similarity matrix (image x texts):")
81
+ print(similarity)
82
+ print("probability matrix (image x texts):")
83
+ print(probs)
84
+
85
+ top_idx = int(np.argmax(probs[0]))
86
+ print("top-1:", texts[top_idx], "prob=", float(probs[0, top_idx]))
code/python/utils.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gzip
2
+ import html
3
+ import os
4
+ from functools import lru_cache
5
+ import ftfy
6
+ import numpy as np
7
+ import regex as re
8
+ from PIL import Image
9
+
10
+
11
+ @lru_cache()
12
+ def bytes_to_unicode():
13
+ bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("隆"),
14
+ ord("卢") + 1)) + list(range(ord("庐"), ord("每") + 1))
15
+ cs = bs[:]
16
+ n = 0
17
+ for b in range(2**8):
18
+ if b not in bs:
19
+ bs.append(b)
20
+ cs.append(2**8 + n)
21
+ n += 1
22
+ cs = [chr(n) for n in cs]
23
+ return dict(zip(bs, cs))
24
+
25
+
26
+ def get_pairs(word):
27
+ pairs = set()
28
+ prev_char = word[0]
29
+ for char in word[1:]:
30
+ pairs.add((prev_char, char))
31
+ prev_char = char
32
+ return pairs
33
+
34
+
35
+ def basic_clean(text):
36
+ text = ftfy.fix_text(text)
37
+ text = html.unescape(html.unescape(text))
38
+ return text.strip()
39
+
40
+
41
+ def whitespace_clean(text):
42
+ text = re.sub(r"\s+", " ", text)
43
+ return text.strip()
44
+
45
+
46
+ class SimpleTokenizer(object):
47
+ def __init__(self, bpe_path: str):
48
+ self.byte_encoder = bytes_to_unicode()
49
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
50
+
51
+ merges = gzip.open(bpe_path, "rb").read().decode("utf-8").split("\n")
52
+ merges = merges[1:49152 - 256 - 2 + 1]
53
+ merges = [tuple(merge.split()) for merge in merges]
54
+
55
+ vocab = list(bytes_to_unicode().values())
56
+ vocab = vocab + [v + "</w>" for v in vocab]
57
+ for merge in merges:
58
+ vocab.append("".join(merge))
59
+ vocab.extend(["<|startoftext|>", "<|endoftext|>"])
60
+ self.encoder = dict(zip(vocab, range(len(vocab))))
61
+ self.decoder = {v: k for k, v in self.encoder.items()}
62
+ self.bpe_ranks = dict(zip(merges, range(len(merges))))
63
+ self.cache = {"<|startoftext|>": "<|startoftext|>",
64
+ "<|endoftext|>": "<|endoftext|>"}
65
+ self.pat = re.compile(
66
+ r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
67
+ re.IGNORECASE,
68
+ )
69
+
70
+ def bpe(self, token):
71
+ if token in self.cache:
72
+ return self.cache[token]
73
+ word = tuple(token[:-1]) + (token[-1] + "</w>",)
74
+ pairs = get_pairs(word)
75
+ if not pairs:
76
+ return token + "</w>"
77
+ while True:
78
+ bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(
79
+ pair, float("inf")))
80
+ if bigram not in self.bpe_ranks:
81
+ break
82
+ first, second = bigram
83
+ new_word = []
84
+ i = 0
85
+ while i < len(word):
86
+ try:
87
+ j = word.index(first, i)
88
+ new_word.extend(word[i:j])
89
+ i = j
90
+ except ValueError:
91
+ new_word.extend(word[i:])
92
+ break
93
+ if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
94
+ new_word.append(first + second)
95
+ i += 2
96
+ else:
97
+ new_word.append(word[i])
98
+ i += 1
99
+ word = tuple(new_word)
100
+ if len(word) == 1:
101
+ break
102
+ pairs = get_pairs(word)
103
+ word = " ".join(word)
104
+ self.cache[token] = word
105
+ return word
106
+
107
+ def encode(self, text):
108
+ bpe_tokens = []
109
+ text = whitespace_clean(basic_clean(text)).lower()
110
+ for token in re.findall(self.pat, text):
111
+ token = "".join(self.byte_encoder[b]
112
+ for b in token.encode("utf-8"))
113
+ bpe_tokens.extend(self.encoder[bpe_token]
114
+ for bpe_token in self.bpe(token).split(" "))
115
+ return bpe_tokens
116
+
117
+
118
+ @lru_cache()
119
+ def default_bpe_path():
120
+ return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
121
+
122
+
123
+ def tokenize(texts, context_length=77, truncate=False):
124
+ if isinstance(texts, str):
125
+ texts = [texts]
126
+ tokenizer = SimpleTokenizer(default_bpe_path())
127
+
128
+ sot_token = tokenizer.encoder["<|startoftext|>"]
129
+ eot_token = tokenizer.encoder["<|endoftext|>"]
130
+
131
+ all_tokens = [[sot_token] +
132
+ tokenizer.encode(text) + [eot_token] for text in texts]
133
+ result = np.zeros((len(all_tokens), context_length), dtype=np.int64)
134
+
135
+ for i, tokens in enumerate(all_tokens):
136
+ if len(tokens) > context_length:
137
+ if truncate:
138
+ tokens = tokens[:context_length]
139
+ tokens[-1] = eot_token
140
+ else:
141
+ raise RuntimeError(
142
+ f"Input {texts[i]} is too long for context length {context_length}")
143
+ result[i, :len(tokens)] = np.array(tokens, dtype=np.int64)
144
+
145
+ return result
146
+
147
+
148
+ def preprocess(image: Image.Image, image_resolution: int = 224) -> np.ndarray:
149
+ image = image.convert("RGB")
150
+ width, height = image.size
151
+ size = image_resolution
152
+ if width < height:
153
+ new_width = size
154
+ new_height = int(round(size * height / width))
155
+ else:
156
+ new_height = size
157
+ new_width = int(round(size * width / height))
158
+
159
+ image = image.resize((new_width, new_height), Image.BICUBIC)
160
+ left = (new_width - size) // 2
161
+ top = (new_height - size) // 2
162
+ image = image.crop((left, top, left + size, top + size))
163
+
164
+ arr = np.array(image).astype(np.float32) / 255.0
165
+ mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
166
+ std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
167
+ arr = (arr - mean) / std
168
+ arr = arr.transpose(2, 0, 1)
169
+ return arr[np.newaxis, ...].astype(np.float32)
models/QCS8550/FP16/clip_text_encoder_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8dedd4fca9b84ccd2bbba8550e0c117bcb3344010ee599d859e81d084bc70e72
3
+ size 128967808
models/QCS8550/FP16/clip_visual_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02828fec57588f911dc46687f8482aa4e70065bbfb816840915cd66422e432c3
3
+ size 180462072
models/qcs8625/FP16/clip_text_encoder_qcs8625_fp16.qnn236.ctx.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3fed4f76c8b8ab41d7acf528ccdb5fb9d971652c6c6432303dee9fc3b473b21
3
+ size 128974216
models/qcs8625/FP16/clip_visual_qcs8625_fp16.qnn236.ctx.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd19e0cc87f6217e7f8d8b9ed17924bf553eccb1d14ca9a7a5a7f3e0987235a2
3
+ size 179035344