Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

.gitattributes +2 -0
CLIP.png +3 -0
README.md +51 -0
code/README.md +43 -0
code/python/CLIP.png +3 -0
code/python/bpe_simple_vocab_16e6.txt.gz +3 -0
code/python/run_test.py +86 -0
code/python/utils.py +169 -0
models/QCS8550/FP16/clip_text_encoder_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin +3 -0
models/QCS8550/FP16/clip_visual_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin +3 -0
models/qcs8625/FP16/clip_text_encoder_qcs8625_fp16.qnn236.ctx.bin +3 -0
models/qcs8625/FP16/clip_visual_qcs8625_fp16.qnn236.ctx.bin +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+CLIP.png filter=lfs diff=lfs merge=lfs -text
+code/python/CLIP.png filter=lfs diff=lfs merge=lfs -text

CLIP.png ADDED Viewed

Git LFS Details

SHA256: 308a3ca4503f1c7a07803916c369d78c4ef501e5ab7fc727da9b5e1d2f9ec85b
Pointer size: 131 Bytes
Size of remote file: 252 kB

README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+---
+license: other
+license_name: aplux-model-farm-license
+license_link: https://aiot.aidlux.com/api/v1/files/license/model_farm_license_en.pdf
+pipeline_tag: image-classification
+tags:
+- AIoT
+- QNN
+---
+![alt text](CLIP.png)
+## Model Details
+The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
+### Model Type
+The base model uses a ResNet50 with several modifications as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
+### Model Versions
+Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
+As part of the staged release process, we have also released the RN101 model, as well as RN50x4, a RN50 scaled up 4x according to the [EfficientNet](https://arxiv.org/abs/1905.11946) scaling rule. In July 2021, we additionally released the RN50x16 and ViT-B/16 models, and in January 2022, the RN50x64 and ViT-L/14 models were released. Lastly, the ViT-L/14@336px model was released in April 2022.
+Please see the paper linked below for further details about their specification.
+### Source model
+- Input shape: [1x3x224x224], [1x77]
+- Number of parameters: 82.25M, 60.61M
+- Model size: 329.00M, 242.44M
+- Output shape: [1x512], [1x512]
+The source model can be found [here](https://github.com/openai/CLIP)
+## Performance Reference
+Please search model by model name in [Model Farm](https://aiot.aidlux.com/en/models)
+## Inference & Model Conversion
+Please search model by model name in [Model Farm](https://aiot.aidlux.com/en/models)
+## License
+- Source Model: [MIT](https://github.com/openai/CLIP/blob/main/LICENSE)
+- Deployable Model: [APLUX-MODEL-FARM-LICENSE](https://aiot.aidlux.com/api/v1/files/license/model_farm_license_en.pdf)

code/README.md ADDED Viewed

	@@ -0,0 +1,43 @@

+## Model Information
+### Source model
+- Input shape: [1x3x224x224], [1x77]
+- Number of parameters: 43.02M, 105.16M
+- Model size: 172.10M, 122.99M
+- Output shape: [1x512], [1x512]
+Source model repository: [clip-vit-base-patch16](https://huggingface.co/openai/clip-vit-base-patch16)
+## Inference with AidLite SDK
+### SDK installation
+Model Farm uses AidLite SDK as the model inference SDK. For details, please refer to the https://docs.aidlux.com/software/ai-sdk/aidlite_guide(https://docs.aidlux.com/software/ai-sdk/aidlite_guide)
+- Install AidLite SDK
+```bash
+# Install the appropriate version of the aidlite sdk
+sudo aid-pkg update
+sudo aid-pkg install aidlite-sdk
+# Download the qnn version that matches the above backend. Eg Install QNN2.36 Aidlite: sudo aid-pkg install aidlite-qnn236
+sudo aid-pkg install aidlite-{QNN VERSION}
+```
+- Verify AidLite SDK
+```bash
+# aidlite sdk c++ check
+python3 -c "import aidlite ; print(aidlite.get_library_version())"
+# aidlite sdk python check
+python3 -c "import aidlite ; print(aidlite.get_py_library_version())"
+```
+### Run Demo
+```bash
+# Environment setup
+pip install ftfy packaging regex tqdm Pillow numpy
+# Run example
+cd model_farm_clip-vit-16_qcs8550_qnn2.36_fp16_aidlite/python
+python3 run_test.py
+```

code/python/CLIP.png ADDED Viewed

Git LFS Details

SHA256: 308a3ca4503f1c7a07803916c369d78c4ef501e5ab7fc727da9b5e1d2f9ec85b
Pointer size: 131 Bytes
Size of remote file: 252 kB

code/python/bpe_simple_vocab_16e6.txt.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
+size 1356917

code/python/run_test.py ADDED Viewed

	@@ -0,0 +1,86 @@

+from utils import preprocess, tokenize
+from PIL import Image
+import numpy as np
+import aidlite
+def create_model(model_path: str, input_tensor_shape: list, output_tensor_shape: list) -> aidlite.soaidlitesdk.Interpreter:
+    model = aidlite.Model.create_instance(model_path)
+    model.set_model_properties(input_tensor_shape, aidlite.DataType.TYPE_FLOAT32,
+                               output_tensor_shape, aidlite.DataType.TYPE_FLOAT32)
+    config = aidlite.Config.create_instance()
+    config.implement_type = aidlite.ImplementType.TYPE_LOCAL
+    config.framework_type = aidlite.FrameworkType.TYPE_QNN
+    config.accelerate_type = aidlite.AccelerateType.TYPE_DSP
+    config.number_of_threads = 4
+    interpreter = aidlite.InterpreterBuilder.build_interpretper_from_model_and_config(
+        model, config)
+    if interpreter is None:
+        raise RuntimeError("build_interpretper_from_model_and_config failed")
+    if interpreter.init() != 0:
+        raise RuntimeError("interpreter init failed")
+    if interpreter.load_model() != 0:
+        raise RuntimeError("interpreter load model failed")
+    return interpreter
+def l2_normalize(x: np.ndarray, axis: int = -1, eps: float = 1e-12) -> np.ndarray:
+    x = np.asarray(x, dtype=np.float32)
+    norm = np.linalg.norm(x, axis=axis, keepdims=True)
+    return x / (norm + eps)
+def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
+    x = np.asarray(x, dtype=np.float32)
+    x = x - np.max(x, axis=axis, keepdims=True)
+    exp_x = np.exp(x)
+    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
+visual_model_path = "../models/clip_visual_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin"
+text_model_path = "../models/clip_text_encoder_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin"
+text_model = create_model(text_model_path, [[1, 77]], [[1, 512]])
+visual_model = create_model(visual_model_path, [[1, 224, 224, 3]], [[1, 512]])
+visual_model.set_input_tensor("image", preprocess(
+    Image.open("CLIP.png")).transpose(0, 2, 3, 1))
+visual_model.invoke()
+image_features = visual_model.get_output_tensor("image_features")
+texts = ["a dog", "a cat", "a diagram"]
+text_features = []
+for text in texts:
+    text_input = tokenize(text)
+    text_model.set_input_tensor("text", text_input.astype(np.float32))
+    text_model.invoke()
+    text_out = text_model.get_output_tensor("text_features")
+    text_features.append(np.asarray(text_out, dtype=np.float32).reshape(1, -1))
+visual_model.destory()
+text_model.destory()
+image_features = np.asarray(image_features, dtype=np.float32).reshape(1, -1)
+text_features = np.concatenate(text_features, axis=0)  # [num_texts, 512]
+image_features = l2_normalize(image_features, axis=1)
+text_features = l2_normalize(text_features, axis=1)
+logit_scale = 101.88
+similarity = logit_scale * (image_features @ text_features.T)  # [1, num_texts]
+probs = softmax(similarity, axis=1)
+print("texts:", texts)
+print("similarity matrix (image x texts):")
+print(similarity)
+print("probability matrix (image x texts):")
+print(probs)
+top_idx = int(np.argmax(probs[0]))
+print("top-1:", texts[top_idx], "prob=", float(probs[0, top_idx]))

code/python/utils.py ADDED Viewed

	@@ -0,0 +1,169 @@

+import gzip
+import html
+import os
+from functools import lru_cache
+import ftfy
+import numpy as np
+import regex as re
+from PIL import Image
+@lru_cache()
+def bytes_to_unicode():
+    bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"),
+                                                          ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+def get_pairs(word):
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    return text.strip()
+class SimpleTokenizer(object):
+    def __init__(self, bpe_path: str):
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        merges = gzip.open(bpe_path, "rb").read().decode("utf-8").split("\n")
+        merges = merges[1:49152 - 256 - 2 + 1]
+        merges = [tuple(merge.split()) for merge in merges]
+        vocab = list(bytes_to_unicode().values())
+        vocab = vocab + [v + "</w>" for v in vocab]
+        for merge in merges:
+            vocab.append("".join(merge))
+        vocab.extend(["<|startoftext|>", "<|endoftext|>"])
+        self.encoder = dict(zip(vocab, range(len(vocab))))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {"<|startoftext|>": "<|startoftext|>",
+                      "<|endoftext|>": "<|endoftext|>"}
+        self.pat = re.compile(
+            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
+            re.IGNORECASE,
+        )
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + (token[-1] + "</w>",)
+        pairs = get_pairs(word)
+        if not pairs:
+            return token + "</w>"
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(
+                pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            word = tuple(new_word)
+            if len(word) == 1:
+                break
+            pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+    def encode(self, text):
+        bpe_tokens = []
+        text = whitespace_clean(basic_clean(text)).lower()
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b]
+                            for b in token.encode("utf-8"))
+            bpe_tokens.extend(self.encoder[bpe_token]
+                              for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+@lru_cache()
+def default_bpe_path():
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
+def tokenize(texts, context_length=77, truncate=False):
+    if isinstance(texts, str):
+        texts = [texts]
+    tokenizer = SimpleTokenizer(default_bpe_path())
+    sot_token = tokenizer.encoder["<|startoftext|>"]
+    eot_token = tokenizer.encoder["<|endoftext|>"]
+    all_tokens = [[sot_token] +
+                  tokenizer.encode(text) + [eot_token] for text in texts]
+    result = np.zeros((len(all_tokens), context_length), dtype=np.int64)
+    for i, tokens in enumerate(all_tokens):
+        if len(tokens) > context_length:
+            if truncate:
+                tokens = tokens[:context_length]
+                tokens[-1] = eot_token
+            else:
+                raise RuntimeError(
+                    f"Input {texts[i]} is too long for context length {context_length}")
+        result[i, :len(tokens)] = np.array(tokens, dtype=np.int64)
+    return result
+def preprocess(image: Image.Image, image_resolution: int = 224) -> np.ndarray:
+    image = image.convert("RGB")
+    width, height = image.size
+    size = image_resolution
+    if width < height:
+        new_width = size
+        new_height = int(round(size * height / width))
+    else:
+        new_height = size
+        new_width = int(round(size * width / height))
+    image = image.resize((new_width, new_height), Image.BICUBIC)
+    left = (new_width - size) // 2
+    top = (new_height - size) // 2
+    image = image.crop((left, top, left + size, top + size))
+    arr = np.array(image).astype(np.float32) / 255.0
+    mean = np.array([0.48145466, 0.4578275, 0.40821073], dtype=np.float32)
+    std = np.array([0.26862954, 0.26130258, 0.27577711], dtype=np.float32)
+    arr = (arr - mean) / std
+    arr = arr.transpose(2, 0, 1)
+    return arr[np.newaxis, ...].astype(np.float32)

models/QCS8550/FP16/clip_text_encoder_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8dedd4fca9b84ccd2bbba8550e0c117bcb3344010ee599d859e81d084bc70e72
+size 128967808

models/QCS8550/FP16/clip_visual_ViT-B_16_qcs8550_fp16.qnn236.ctx.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02828fec57588f911dc46687f8482aa4e70065bbfb816840915cd66422e432c3
+size 180462072

models/qcs8625/FP16/clip_text_encoder_qcs8625_fp16.qnn236.ctx.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b3fed4f76c8b8ab41d7acf528ccdb5fb9d971652c6c6432303dee9fc3b473b21
+size 128974216

models/qcs8625/FP16/clip_visual_qcs8625_fp16.qnn236.ctx.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cd19e0cc87f6217e7f8d8b9ed17924bf553eccb1d14ca9a7a5a7f3e0987235a2
+size 179035344