ashvardanian commited on
Commit
ffae8d1
1 Parent(s): ef52b55

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,16 +1,130 @@
1
  ---
2
  license: apache-2.0
3
- language:
4
- - en
5
- library_name: UForm
6
  pipeline_tag: feature-extraction
7
  tags:
8
- - clip
9
- - vision
10
- - transformers.js
11
- - transformers
12
  datasets:
13
- - sbu_captions
14
- - visual_genome
15
- - ChristophSchuhmann/MS_COCO_2017_URL_TEXT
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
 
3
  pipeline_tag: feature-extraction
4
  tags:
5
+ - clip
6
+ - vision
 
 
7
  datasets:
8
+ - Ziyang/yfcc15m
9
+ - conceptual_captions
10
+ ---
11
+ <h1 align="center">UForm</h1>
12
+ <h3 align="center">
13
+ Multi-Modal Inference Library<br/>
14
+ For Semantic Search Applications<br/>
15
+ </h3>
16
+
17
+ ---
18
+
19
+ UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!
20
+
21
+ This is model card of the __English only model__ with:
22
+
23
+ * 4 layers BERT (2 layers for unimodal encoding and rest layers for multimodal encoding)
24
+ * ViT-S/16 (image resolution is 224x224)
25
+
26
+
27
+ If you need Multilingual model, check [this](https://huggingface.co/unum-cloud/uform-vl-multilingual).
28
+
29
+ ## Evaluation
30
+
31
+ The following metrics were obtained with multimodal re-ranking (text-to-image retrieval):
32
+
33
+ | Dataset |Recall@1 | Recall@5 | Recall@10 |
34
+ | :------ | ------: | --------: | --------: |
35
+ | Zero-Shot Flickr | 0.565 | 0.790 | 0.860 |
36
+ | Zero-Shot MS-COCO | 0.281 | 0.525 | 0.645 |
37
+
38
+ ImageNet-Top1: 0.361 \
39
+ ImageNet-Top5: 0.608
40
+
41
+ ## Installation
42
+
43
+ ```bash
44
+ pip install uform[torch]
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ To load the model:
50
+
51
+ ```python
52
+ import uform
53
+
54
+ model, processor = uform.get_model('unum-cloud/uform-vl-english-small')
55
+ ```
56
+
57
+ To encode data:
58
+
59
+ ```python
60
+ from PIL import Image
61
+
62
+ text = 'a small red panda in a zoo'
63
+ image = Image.open('red_panda.jpg')
64
+
65
+ image_data = processor.preprocess_image(image)
66
+ text_data = processor.preprocess_text(text)
67
+
68
+ image_features, image_embedding = model.encode_image(image_data, return_features=True)
69
+ text_features, text_embedding = model.encode_text(text_data, return_features=True)
70
+ joint_embedding = model.encode_multimodal(image=image_data, text=text_data)
71
+ ```
72
+
73
+ To get features:
74
+
75
+ ```python
76
+ image_features, image_embedding = model.encode_image(image_data, return_features=True)
77
+ text_features, text_embedding = model.encode_text(text_data, return_features=True)
78
+ ```
79
+
80
+ These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
81
+
82
+ ```python
83
+ joint_embedding = model.encode_multimodal(
84
+ image_features=image_features,
85
+ text_features=text_features,
86
+ attention_mask=text_data['attention_mask']
87
+ )
88
+ ```
89
+
90
+ There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
91
+
92
+ ### Cosine Similarity
93
+
94
+ ```python
95
+ import torch.nn.functional as F
96
+
97
+ similarity = F.cosine_similarity(image_embedding, text_embedding)
98
+ ```
99
+
100
+ The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
101
+
102
+ __Pros__:
103
+
104
+ - Computationally cheap.
105
+ - Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
106
+ - Suitable for retrieval in large collections.
107
+
108
+ __Cons__:
109
+
110
+ - Takes into account only coarse-grained features.
111
+
112
+
113
+ ### Matching Score
114
+
115
+ Unlike cosine similarity, unimodal embedding are not enough.
116
+ Joint embedding will be needed and the resulting `score` will belong to the `[0, 1]` range, `1` meaning the absolute match.
117
+
118
+ ```python
119
+ score = model.get_matching_scores(joint_embedding)
120
+ ```
121
+
122
+ __Pros__:
123
+
124
+ - Joint embedding captures fine-grained features.
125
+ - Suitable for re-ranking – sorting retrieval result.
126
+
127
+ __Cons__:
128
+
129
+ - Resource-intensive.
130
+ - Not suitable for retrieval in large collections.
config.json CHANGED
@@ -19,14 +19,22 @@
19
  "dropout_prob": 0.1
20
  },
21
  "image_encoder": {
 
 
 
 
 
 
 
 
 
 
22
  "dim": 384,
23
  "patch_size": 16,
24
  "image_size": 224,
25
  "num_layers": 12,
26
  "num_heads": 6,
27
  "embedding_dim": 256,
28
- "normalization_means": [0.48145466, 0.4578275, 0.40821073],
29
- "normalization_deviations": [0.26862954, 0.26130258, 0.27577711],
30
  "pooling": "cls"
31
  }
32
  }
 
19
  "dropout_prob": 0.1
20
  },
21
  "image_encoder": {
22
+ "normalization_means": [
23
+ 0.48145466,
24
+ 0.4578275,
25
+ 0.40821073
26
+ ],
27
+ "normalization_deviations": [
28
+ 0.26862954,
29
+ 0.26130258,
30
+ 0.27577711
31
+ ],
32
  "dim": 384,
33
  "patch_size": 16,
34
  "image_size": 224,
35
  "num_layers": 12,
36
  "num_heads": 6,
37
  "embedding_dim": 256,
 
 
38
  "pooling": "cls"
39
  }
40
  }
image_encoder.mlpackage/Data/com.apple.CoreML/model.mlmodel CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f06b44c342f406cb5d3bfa2c009b55bbac455ba97bb07846b6161287aee4c1a3
3
- size 111195
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f18da9209de45eca55e0e0f23f3f3f7c4ad1a03d96533a7c5108df0a125c1663
3
+ size 111190
image_encoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:80df8d05daadf940deb3a21f1f7fa4b03dd5e7ee4a2ec0be9cca87a6751ebc31
3
  size 87106624
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:750f259efc991efc9f32aae3cb8960508e8ae68122b7fa83f26b6ba97b78d1cc
3
  size 87106624
image_encoder.mlpackage/Manifest.json CHANGED
@@ -1,18 +1,18 @@
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
- "20458914-6BBA-4AF3-ADB8-60CBA7CE7713": {
5
- "author": "com.apple.CoreML",
6
- "description": "CoreML Model Specification",
7
- "name": "model.mlmodel",
8
- "path": "com.apple.CoreML/model.mlmodel"
9
- },
10
- "CD25C0E2-C2BB-4C45-8EF7-2D23E3E550A6": {
11
  "author": "com.apple.CoreML",
12
  "description": "CoreML Model Weights",
13
  "name": "weights",
14
  "path": "com.apple.CoreML/weights"
 
 
 
 
 
 
15
  }
16
  },
17
- "rootModelIdentifier": "20458914-6BBA-4AF3-ADB8-60CBA7CE7713"
18
  }
 
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
+ "703D1BA1-CC3F-468B-B079-5914AE99ECF4": {
 
 
 
 
 
 
5
  "author": "com.apple.CoreML",
6
  "description": "CoreML Model Weights",
7
  "name": "weights",
8
  "path": "com.apple.CoreML/weights"
9
+ },
10
+ "C8AB2870-74DF-4EDF-9F73-4903BE34700E": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Specification",
13
+ "name": "model.mlmodel",
14
+ "path": "com.apple.CoreML/model.mlmodel"
15
  }
16
  },
17
+ "rootModelIdentifier": "C8AB2870-74DF-4EDF-9F73-4903BE34700E"
18
  }
image_encoder.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0eb66a62c8edb29ee6791882d7153501327d8450798a75685a835c051eadd54b
3
- size 22589727
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0d6bc2354318ad9d5a53cad2aa219e307bcdcf3010aa9a8cdb3f385c574015a
3
+ size 22589738
image_encoder.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b194aad10463422dcce1605116b505301cd5cc24382b59869ac59090dcd8de42
3
- size 43604334
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad037e636f42ca84f3e2c2a3c1180649f46507ff5c771f6d7def14481f84dbb6
3
+ size 43620486
text_encoder.mlpackage/Data/com.apple.CoreML/weights/weight.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d022e8345991dbf0352357afd63df088628115e22066e864506f9b71de5c480f
3
  size 151459264
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:091bd543cf3c3ec3eddd4b0d50fca6309b47e15163c69c8e4a380c708ed9a320
3
  size 151459264
text_encoder.mlpackage/Manifest.json CHANGED
@@ -1,18 +1,18 @@
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
- "74D647A6-216E-48BE-A0A6-BD4BF4EEB3CA": {
5
- "author": "com.apple.CoreML",
6
- "description": "CoreML Model Weights",
7
- "name": "weights",
8
- "path": "com.apple.CoreML/weights"
9
- },
10
- "BE4077DB-78CC-4549-B7F9-0F6AA0AEFF56": {
11
  "author": "com.apple.CoreML",
12
  "description": "CoreML Model Specification",
13
  "name": "model.mlmodel",
14
  "path": "com.apple.CoreML/model.mlmodel"
 
 
 
 
 
 
15
  }
16
  },
17
- "rootModelIdentifier": "BE4077DB-78CC-4549-B7F9-0F6AA0AEFF56"
18
  }
 
1
  {
2
  "fileFormatVersion": "1.0.0",
3
  "itemInfoEntries": {
4
+ "702D723F-AA56-4B0B-A29C-6F7D643170BA": {
 
 
 
 
 
 
5
  "author": "com.apple.CoreML",
6
  "description": "CoreML Model Specification",
7
  "name": "model.mlmodel",
8
  "path": "com.apple.CoreML/model.mlmodel"
9
+ },
10
+ "D5AF71DD-F123-410D-9CED-F0017B203BF1": {
11
+ "author": "com.apple.CoreML",
12
+ "description": "CoreML Model Weights",
13
+ "name": "weights",
14
+ "path": "com.apple.CoreML/weights"
15
  }
16
  },
17
+ "rootModelIdentifier": "702D723F-AA56-4B0B-A29C-6F7D643170BA"
18
  }
text_encoder.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c74e2c8fdbc145780e2473b35553ad0fef523ac7a7ac17d4920532179fdbcf70
3
  size 37994692
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0b8c0dca234622eb2ee57acf22c893394a79126b7dc4be2ed7cb7dcb1095aca
3
  size 37994692
text_encoder.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e1d667c4601292fa2197face6fa472726ce69165e9674dc7f3f7ca66b35d82e5
3
- size 114152802
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8636c1447e28b39b9d1d51902497acd65007aad26b3cf84912fd917046bccb7
3
+ size 114159394