neggles commited on
Commit
885a42a
Β·
1 Parent(s): 64e99b1

does this work?

Browse files
Files changed (6) hide show
  1. .gitignore +276 -0
  2. .vscode/settings.json +58 -0
  3. README.md +3 -3
  4. app.css +27 -0
  5. app.py +197 -0
  6. requirements.txt +5 -0
.gitignore ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Created by https://www.toptal.com/developers/gitignore/api/linux,macos,windows,visualstudiocode,python
2
+ # Edit at https://www.toptal.com/developers/gitignore?templates=linux,macos,windows,visualstudiocode,python
3
+
4
+ ### Linux ###
5
+ *~
6
+
7
+ # temporary files which can be created if a process still has a handle open of a deleted file
8
+ .fuse_hidden*
9
+
10
+ # KDE directory preferences
11
+ .directory
12
+
13
+ # Linux trash folder which might appear on any partition or disk
14
+ .Trash-*
15
+
16
+ # .nfs files are created when an open file is removed but is still being accessed
17
+ .nfs*
18
+
19
+ ### macOS ###
20
+ # General
21
+ .DS_Store
22
+ .AppleDouble
23
+ .LSOverride
24
+
25
+ # Icon must end with two \r
26
+ Icon
27
+
28
+
29
+ # Thumbnails
30
+ ._*
31
+
32
+ # Files that might appear in the root of a volume
33
+ .DocumentRevisions-V100
34
+ .fseventsd
35
+ .Spotlight-V100
36
+ .TemporaryItems
37
+ .Trashes
38
+ .VolumeIcon.icns
39
+ .com.apple.timemachine.donotpresent
40
+
41
+ # Directories potentially created on remote AFP share
42
+ .AppleDB
43
+ .AppleDesktop
44
+ Network Trash Folder
45
+ Temporary Items
46
+ .apdisk
47
+
48
+ ### macOS Patch ###
49
+ # iCloud generated files
50
+ *.icloud
51
+
52
+ ### Python ###
53
+ # Byte-compiled / optimized / DLL files
54
+ __pycache__/
55
+ *.py[cod]
56
+ *$py.class
57
+
58
+ # C extensions
59
+ *.so
60
+
61
+ # Distribution / packaging
62
+ .Python
63
+ build/
64
+ develop-eggs/
65
+ dist/
66
+ downloads/
67
+ eggs/
68
+ .eggs/
69
+ lib/
70
+ lib64/
71
+ parts/
72
+ sdist/
73
+ var/
74
+ wheels/
75
+ share/python-wheels/
76
+ *.egg-info/
77
+ .installed.cfg
78
+ *.egg
79
+ MANIFEST
80
+
81
+ # PyInstaller
82
+ # Usually these files are written by a python script from a template
83
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
84
+ *.manifest
85
+ *.spec
86
+
87
+ # Installer logs
88
+ pip-log.txt
89
+ pip-delete-this-directory.txt
90
+
91
+ # Unit test / coverage reports
92
+ htmlcov/
93
+ .tox/
94
+ .nox/
95
+ .coverage
96
+ .coverage.*
97
+ .cache
98
+ nosetests.xml
99
+ coverage.xml
100
+ *.cover
101
+ *.py,cover
102
+ .hypothesis/
103
+ .pytest_cache/
104
+ cover/
105
+
106
+ # Translations
107
+ *.mo
108
+ *.pot
109
+
110
+ # Django stuff:
111
+ *.log
112
+ local_settings.py
113
+ db.sqlite3
114
+ db.sqlite3-journal
115
+
116
+ # Flask stuff:
117
+ instance/
118
+ .webassets-cache
119
+
120
+ # Scrapy stuff:
121
+ .scrapy
122
+
123
+ # Sphinx documentation
124
+ docs/_build/
125
+
126
+ # PyBuilder
127
+ .pybuilder/
128
+ target/
129
+
130
+ # Jupyter Notebook
131
+ .ipynb_checkpoints
132
+
133
+ # IPython
134
+ profile_default/
135
+ ipython_config.py
136
+
137
+ # pyenv
138
+ # For a library or package, you might want to ignore these files since the code is
139
+ # intended to run in multiple environments; otherwise, check them in:
140
+ # .python-version
141
+
142
+ # pipenv
143
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
144
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
145
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
146
+ # install all needed dependencies.
147
+ #Pipfile.lock
148
+
149
+ # poetry
150
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
151
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
152
+ # commonly ignored for libraries.
153
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
154
+ #poetry.lock
155
+
156
+ # pdm
157
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
158
+ #pdm.lock
159
+ # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
160
+ # in version control.
161
+ # https://pdm.fming.dev/#use-with-ide
162
+ .pdm.toml
163
+
164
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
165
+ __pypackages__/
166
+
167
+ # Celery stuff
168
+ celerybeat-schedule
169
+ celerybeat.pid
170
+
171
+ # SageMath parsed files
172
+ *.sage.py
173
+
174
+ # Environments
175
+ .env
176
+ .venv
177
+ env/
178
+ venv/
179
+ ENV/
180
+ env.bak/
181
+ venv.bak/
182
+
183
+ # Spyder project settings
184
+ .spyderproject
185
+ .spyproject
186
+
187
+ # Rope project settings
188
+ .ropeproject
189
+
190
+ # mkdocs documentation
191
+ /site
192
+
193
+ # mypy
194
+ .mypy_cache/
195
+ .dmypy.json
196
+ dmypy.json
197
+
198
+ # Pyre type checker
199
+ .pyre/
200
+
201
+ # pytype static type analyzer
202
+ .pytype/
203
+
204
+ # Cython debug symbols
205
+ cython_debug/
206
+
207
+ # PyCharm
208
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
209
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
210
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
211
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
212
+ #.idea/
213
+
214
+ ### Python Patch ###
215
+ # Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
216
+ poetry.toml
217
+
218
+ # ruff
219
+ .ruff_cache/
220
+
221
+ # LSP config files
222
+ pyrightconfig.json
223
+
224
+ ### VisualStudioCode ###
225
+ .vscode/*
226
+ !.vscode/settings.json
227
+ !.vscode/tasks.json
228
+ !.vscode/launch.json
229
+ !.vscode/extensions.json
230
+ !.vscode/*.code-snippets
231
+
232
+ # Local History for Visual Studio Code
233
+ .history/
234
+
235
+ # Built Visual Studio Code Extensions
236
+ *.vsix
237
+
238
+ ### VisualStudioCode Patch ###
239
+ # Ignore all local history of files
240
+ .history
241
+ .ionide
242
+
243
+ ### Windows ###
244
+ # Windows thumbnail cache files
245
+ Thumbs.db
246
+ Thumbs.db:encryptable
247
+ ehthumbs.db
248
+ ehthumbs_vista.db
249
+
250
+ # Dump file
251
+ *.stackdump
252
+
253
+ # Folder config file
254
+ [Dd]esktop.ini
255
+
256
+ # Recycle Bin used on file shares
257
+ $RECYCLE.BIN/
258
+
259
+ # Windows Installer files
260
+ *.cab
261
+ *.msi
262
+ *.msix
263
+ *.msm
264
+ *.msp
265
+
266
+ # Windows shortcuts
267
+ *.lnk
268
+
269
+ # End of https://www.toptal.com/developers/gitignore/api/linux,macos,windows,visualstudiocode,python
270
+
271
+ # app dir
272
+ /app/
273
+
274
+ # misc and temp for dev reasons
275
+ /misc/
276
+ /temp/
.vscode/settings.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "editor.insertSpaces": true,
3
+ "editor.tabSize": 4,
4
+ "files.trimTrailingWhitespace": true,
5
+ "editor.rulers": [ 100, 120 ],
6
+
7
+ "files.associations": {
8
+ "*.yaml": "yaml",
9
+ ".gitignore": "gitignore",
10
+ ".prettierrc": "json",
11
+ "requirements*.txt": "pip-requirements"
12
+ },
13
+
14
+ "[python]": {
15
+ "editor.wordBasedSuggestions": false,
16
+ "editor.formatOnSave": true,
17
+ "editor.defaultFormatter": "ms-python.black-formatter",
18
+ "editor.codeActionsOnSave": {
19
+ "source.organizeImports": true
20
+ }
21
+ },
22
+ "python.analysis.extraPaths": [
23
+ "../..",
24
+ "../../repositories/k-diffusion",
25
+ "../../repositories/generative-models",
26
+ "../../repositories/stable-diffusion-stability-ai"
27
+ ],
28
+ "black-formatter.args": [ "--line-length=110" ],
29
+
30
+ "[json]": {
31
+ "editor.tabSize": 2,
32
+ "editor.detectIndentation": false,
33
+ "editor.formatOnSave": true,
34
+ "editor.formatOnSaveMode": "file",
35
+ "editor.defaultFormatter": "vscode.json-language-features"
36
+ },
37
+ "[jsonc]": {
38
+ "editor.tabSize": 2,
39
+ "editor.detectIndentation": false,
40
+ "editor.formatOnSave": true,
41
+ "editor.formatOnSaveMode": "file",
42
+ "editor.defaultFormatter": "vscode.json-language-features"
43
+ },
44
+
45
+ "[yaml]": {
46
+ "editor.detectIndentation": false,
47
+ "editor.tabSize": 2,
48
+ "editor.formatOnSave": true,
49
+ "editor.formatOnSaveMode": "file"
50
+ },
51
+ "yaml.format.bracketSpacing": true,
52
+ "yaml.format.proseWrap": "preserve",
53
+ "yaml.format.singleQuote": false,
54
+ "yaml.format.printWidth": 110,
55
+
56
+ "remote.autoForwardPorts": false,
57
+ "remote.autoForwardPortsSource": "process"
58
+ }
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Clip Tokenizer Util
3
  emoji: πŸ†
4
- colorFrom: yellow
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.50.2
8
  app_file: app.py
@@ -10,4 +10,4 @@ pinned: false
10
  license: bsd-3-clause
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
  title: Clip Tokenizer Util
3
  emoji: πŸ†
4
+ colorFrom: purple
5
+ colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 3.50.2
8
  app_file: app.py
 
10
  license: bsd-3-clause
11
  ---
12
 
13
+ a gradio space for playing with CLIP's tokenizer
app.css ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .tokenizer-token {
2
+ cursor: pointer;
3
+ }
4
+ .tokenizer-token-0 {
5
+ background: rgba(255, 0, 0, 0.05);
6
+ }
7
+ .tokenizer-token-0:hover {
8
+ background: rgba(255, 0, 0, 0.15);
9
+ }
10
+ .tokenizer-token-1 {
11
+ background: rgba(0, 255, 0, 0.05);
12
+ }
13
+ .tokenizer-token-1:hover {
14
+ background: rgba(0, 255, 0, 0.15);
15
+ }
16
+ .tokenizer-token-2 {
17
+ background: rgba(0, 0, 255, 0.05);
18
+ }
19
+ .tokenizer-token-2:hover {
20
+ background: rgba(0, 0, 255, 0.15);
21
+ }
22
+ .tokenizer-token-3 {
23
+ background: rgba(255, 156, 0, 0.05);
24
+ }
25
+ .tokenizer-token-3:hover {
26
+ background: rgba(255, 156, 0, 0.15);
27
+ }
app.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import html
2
+ import logging
3
+ from pathlib import Path
4
+
5
+ import gradio as gr
6
+ from gradio.themes.utils import colors
7
+ from transformers import CLIPTokenizer
8
+
9
+ logging.basicConfig(level=logging.INFO)
10
+ logger = logging.getLogger(__name__)
11
+ gr_logger = logging.getLogger("gradio")
12
+ gr_logger.setLevel(logging.INFO)
13
+
14
+
15
+ class ClipUtil:
16
+ def __init__(self):
17
+ logger.info("Loading ClipUtil")
18
+
19
+ self.theme = gr.themes.Base(
20
+ primary_hue=colors.violet,
21
+ secondary_hue=colors.indigo,
22
+ neutral_hue=colors.slate,
23
+ font=[gr.themes.GoogleFont("Fira Sans"), "ui-sans-serif", "system-ui", "sans-serif"],
24
+ font_mono=[gr.themes.GoogleFont("Fira Code"), "ui-monospace", "Consolas", "monospace"],
25
+ ).set(
26
+ slider_color_dark="*primary_500",
27
+ )
28
+
29
+ try:
30
+ self.css = Path(__file__).with_suffix(".css").read_text()
31
+ except Exception:
32
+ logger.exception("Failed to load CSS file")
33
+ self.css = ""
34
+
35
+ self.tokenizer: CLIPTokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
36
+ self.vocab = {v: k for k, v in self.tokenizer.get_vocab().items()}
37
+
38
+ self.blocks = gr.Blocks(
39
+ title="ClipTokenizerUtil", analytics_enabled=False, theme=self.theme, css=self.css
40
+ )
41
+
42
+ def tokenize(self, text: str, input_ids: bool = False):
43
+ if input_ids:
44
+ tokens = [int(x.strip()) for x in text.split(",")]
45
+ else:
46
+ tokens = self.tokenizer(text, return_tensors="np").input_ids.squeeze().tolist()
47
+
48
+ code = ""
49
+ ids = []
50
+ current_ids = []
51
+ class_index = 0
52
+
53
+ byte_decoder = self.tokenizer.byte_decoder
54
+
55
+ def dump(last=False):
56
+ nonlocal code, ids, current_ids
57
+ words = [self.vocab.get(x, "") for x in current_ids]
58
+
59
+ def wordscode(ids, word):
60
+ nonlocal class_index
61
+ word_title = html.escape(", ".join([str(x) for x in ids]))
62
+ res = f"""
63
+ <span class='tokenizer-token tokenizer-token-{class_index % 4}' title='{word_title}'>
64
+ {html.escape(word)}
65
+ </span>
66
+ """
67
+ class_index += 1
68
+ return res
69
+
70
+ try:
71
+ word = bytearray([byte_decoder[x] for x in "".join(words)]).decode("utf-8")
72
+ except UnicodeDecodeError:
73
+ if last:
74
+ word = "❌" * len(current_ids)
75
+ elif len(current_ids) > 4:
76
+ id = current_ids[0]
77
+ ids += [id]
78
+ local_ids = current_ids[1:]
79
+ code += wordscode([id], "❌")
80
+
81
+ current_ids = []
82
+ for id in local_ids:
83
+ current_ids.append(id)
84
+ dump()
85
+ return
86
+ else:
87
+ return
88
+
89
+ # word = word.replace("</w>", " ")
90
+
91
+ code += wordscode(current_ids, word)
92
+ ids += current_ids
93
+
94
+ current_ids = []
95
+
96
+ for token in tokens:
97
+ token = int(token)
98
+ current_ids.append(token)
99
+
100
+ dump()
101
+
102
+ dump(last=True)
103
+
104
+ ids_html = f"""
105
+ <p>
106
+ Token count: {len(ids)}
107
+ <br>
108
+ {", ".join([str(x) for x in ids])}
109
+ </p>"""
110
+
111
+ return code, ids_html
112
+
113
+ def create_components(self):
114
+ with self.blocks:
115
+ # title bar
116
+ with gr.Row().style(equal_height=True):
117
+ with gr.Column(scale=12, elem_id="header_col"):
118
+ self.header_title = gr.Markdown(
119
+ "## CLIP Tokenizer Util",
120
+ elem_id="header_title",
121
+ )
122
+ with gr.Column(scale=1, min_width=90, elem_id="button_col"):
123
+ with gr.Row(elem_id="button_row"):
124
+ self.reload_btn = gr.Button(
125
+ label="refresh",
126
+ elem_id="refresh_btn",
127
+ type="button",
128
+ value="πŸ”„",
129
+ variant="primary",
130
+ )
131
+
132
+ with gr.Tabs() as in_tabs:
133
+ with gr.Tab(label="Text Input", id="text_input_tab"):
134
+ with gr.Row().style(equal_height=True):
135
+ with gr.Column(scale=12, elem_id="text_input_col"):
136
+ self.text_input = gr.Textbox(
137
+ label="Text Input",
138
+ elem_id="tokenizer_prompt",
139
+ show_label=False,
140
+ lines=8,
141
+ placeholder="Prompt for tokenization",
142
+ )
143
+ self.text_button = gr.Button(
144
+ label="Tokenize",
145
+ elem_id="go_button",
146
+ value="Go",
147
+ variant="primary",
148
+ )
149
+
150
+ with gr.Tab(label="Token Input", id="token_input_tab"):
151
+ with gr.Row().style(equal_height=True):
152
+ with gr.Column(scale=12, elem_id="text_input_col"):
153
+ self.token_input = gr.Textbox(
154
+ lines=5,
155
+ label="Text Input",
156
+ elem_id="text_input",
157
+ placeholder="Enter text here",
158
+ )
159
+ self.token_button = gr.Button(
160
+ label="Tokenize",
161
+ elem_id="go_button",
162
+ type="button",
163
+ value="Go",
164
+ variant="primary",
165
+ )
166
+
167
+ with gr.Tabs():
168
+ with gr.Tab("Text"):
169
+ tokenized_text = gr.HTML(elem_id="tokenized_text")
170
+ with gr.Tab("Tokens"):
171
+ tokenized_ids = gr.HTML(elem_id="tokenized_ids")
172
+
173
+ self.text_button.click(
174
+ fn=self.tokenize,
175
+ inputs=[self.text_input],
176
+ outputs=[tokenized_text, tokenized_ids],
177
+ )
178
+ self.token_button.click(
179
+ fn=self.tokenize,
180
+ inputs=[self.token_input],
181
+ outputs=[tokenized_text, tokenized_ids],
182
+ kwargs={"input_ids": True},
183
+ )
184
+
185
+ def launch(self, **kwargs) -> None:
186
+ return self.blocks.launch(
187
+ server_name="0.0.0.0",
188
+ show_error=True,
189
+ enable_queue=True,
190
+ **kwargs,
191
+ )
192
+
193
+
194
+ if __name__ == "__main__":
195
+ clip_util = ClipUtil()
196
+ clip_util._create_components()
197
+ clip_util.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ transformers
2
+ tokenizers
3
+ torch
4
+ open-clip-torch==2.20.0
5
+ gradio==3.50.2