im commited on
Commit
085ad06
1 Parent(s): ab03e32

add BPE, Unigram, and WordPiece tokenisers

Browse files
Files changed (2) hide show
  1. .streamlit/config.toml +1 -1
  2. app.py +279 -82
.streamlit/config.toml CHANGED
@@ -1,3 +1,3 @@
1
  [theme]
2
- base="dark"
3
  font="sans serif"
 
1
  [theme]
2
+ base="light"
3
  font="sans serif"
app.py CHANGED
@@ -39,13 +39,16 @@ advancements:
39
 
40
  Understanding these foundational concepts is crucial to comprehending the overall structure and function of the
41
  Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles
42
- within the architecture are essential to the model's ability to process and generate language.
 
 
43
 
44
  Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series
45
  solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a
46
  desire to provide a thorough and in-depth understanding of each component of the Transformer model.
47
 
48
-
 
49
  """)
50
 
51
  with st.expander("Copernicus Museum in Warsaw"):
@@ -56,7 +59,7 @@ providing the opportunity to explore different concepts firsthand. **They even f
56
  illustrates the neuron activation process during the recognition of handwritten digits!**
57
 
58
  Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first
59
- establishing a firm understanding of Tokenisation and embeddings. This foundation will equip us with the knowledge
60
  needed to delve into the more complex aspects of these models later on.
61
 
62
  I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided
@@ -67,6 +70,7 @@ through this virtual, interactive museum of AI. Enjoy the exploration!
67
 
68
  divider()
69
 
 
70
  st.header("Tokenisers and Tokenisation")
71
 
72
  st.write("""Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP)
@@ -78,40 +82,36 @@ the language of the text.
78
 
79
  from transformers import AutoTokenizer
80
 
81
- sentence = st.text_input("Sentence to explore (you can change it):", value="Tokenising text is a fundamental step for NLP models.")
82
  sentence_split = sentence.split()
83
  tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
84
  sentence_tokenise_bert = tokenizer.tokenize(sentence)
85
  sentence_encode_bert = tokenizer.encode(sentence)
86
  sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
87
 
88
- st.write(f"""
89
- Consider the sentence:
90
- """)
91
- st.code(f"""
92
- "{sentence}"
93
- """)
94
-
95
- st.write(f"""
96
- A basic word-level Tokenisation would produce tokens:
97
  """)
98
  st.code(f"""
99
  {sentence_split}
100
  """)
101
 
102
 
103
- st.write(f"""
104
- However, a more sophisticated algorithm, with several optimizations, might generate a different set of tokens:
105
- """)
 
 
 
106
  st.code(f"""
107
  {sentence_tokenise_bert}
108
  """)
109
 
110
- with st.expander("click to look at the code:"):
111
  st.code(f"""\
112
  from transformers import AutoTokenizer
113
 
114
- sentence = st.text_input("Sentence to explore (you can change it):", value="{sentence}")
115
  sentence_split = sentence.split()
116
  tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
117
  sentence_tokenise_bert = tokenizer.tokenize(sentence)
@@ -144,38 +144,91 @@ Changing the Tokeniser is akin to altering the model's language on the fly—ima
144
  then taking the exam in French or Spanish. You might get lucky, but it's a considerable risk.
145
  """)
146
 
147
- with st.expander("""Let's train a tokeniser using our own dataset"""):
148
- training_dataset = """\
149
- Beautiful is better than ugly.
150
- Explicit is better than implicit.
151
- Simple is better than complex.
152
- Complex is better than complicated.
153
- Flat is better than nested.
154
- Sparse is better than dense.
155
- Readability counts.
156
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  training_dataset = st.text_area("*Training Dataset - Vocabulary:*", value=training_dataset, height=200)
158
  training_dataset = training_dataset.split('\n')
159
  vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
 
 
160
 
161
 
162
- # TODO: add more tokenisers
163
  from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
164
  tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
165
- # tokenizer = Tokenizer(models.Unigram())
166
  tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
167
  tokenizer.decoder = decoders.ByteLevel()
168
  trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocabulary_size)
169
-
170
- # trainer = trainers.UnigramTrainer(
171
- # vocab_size=20000,
172
- # initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
173
- # special_tokens=["<PAD>", "<BOS>", "<EOS>"],
174
- # )
175
-
176
  tokenizer.train_from_iterator(training_dataset, trainer=trainer)
177
-
178
- sentence = st.text_input("*Text to tokenise:*", value="[CLS] Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
179
  output = tokenizer.encode(sentence)
180
 
181
  st.write("*Tokens:*")
@@ -185,60 +238,204 @@ Readability counts.
185
  attention_mast: {output.attention_mask}
186
  """)
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
  st.subheader("Try Yourself:")
191
- st.write(f""" *Aim to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the
 
192
  efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence
193
- more understandable and containing less tokens*
194
  """)
195
 
196
- st.caption("Special tokens meaning:")
197
- st.write("""
198
- \\#\\# prefix: It means that the preceding string is not whitespace, any token with this prefix should be
199
- merged with the previous token when you convert the tokens back to a string.
200
-
201
- [UNK]: Stands for "unknown". This token is used to represent any word that is not in the model's vocabulary. Since
202
- most models have a fixed-size vocabulary, it's not possible to have a unique token for every possible word. The [UNK]
203
- token is used as a catch-all for any words the model hasn't seen before. E.g. in our example we 'decided' that Large
204
- Language (LL) abbreviation is not part of the model's vocabulary.
205
-
206
- [CLS]: Stands for "classification". In models like BERT, this token is added at the beginning of every input
207
- sequence. The representation (embedding) of this token is used as the aggregate sequence representation for
208
- classification tasks. In other words, the model is trained to encode the meaning of the entire sequence into this token.
209
-
210
- [SEP]: Stands for "separator". This token is used to separate different sequences when the model needs to take more
211
- than one input sequence. For example, in question-answering tasks, the model takes two inputs: a question and a
212
- passage that contains the answer. The two inputs are separated by a [SEP] token.
213
-
214
- [MASK]: This token is specific to models like BERT, which are trained with a masked language modelling objective.
215
- During training, some percentage of the input tokens are replaced with the [MASK] token, and the model's goal is to
216
- predict the original value of the masked tokens.
217
-
218
- [PAD]: Stands for "padding". This token is used to fill in the extra spaces when batching sequences of different
219
- lengths together. Since models require input sequences to be the same length, shorter sequences are extended with [
220
- PAD] tokens. In our example, we extended the length of the input sequence to 16 tokens.
221
 
222
- """)
223
- st.caption("Python code:")
224
- st.code(f"""
225
- from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
226
- tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
227
- tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
228
- tokenizer.decoder = decoders.ByteLevel()
229
- trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size={vocabulary_size})
230
- training_dataset = {training_dataset}
231
- tokenizer.train_from_iterator(training_dataset, trainer=trainer)
232
- output = tokenizer.encode("{sentence}")
233
- """, language='python')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
 
236
  with st.expander("References:"):
237
  st.write("""\
238
- - https://huggingface.co/docs/transformers/tokenizer_summary
239
- - https://huggingface.co/docs/tokenizers/training_from_memory
240
- - https://en.wikipedia.org/wiki/Byte_pair_encoding
241
-
242
  """)
243
 
244
  divider()
 
39
 
40
  Understanding these foundational concepts is crucial to comprehending the overall structure and function of the
41
  Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles
42
+ within the architecture are essential to the model's ability to process and generate language. In my view,
43
+ a comprehensive and simple explanation may give a reader a significant advantage in using LLMs. Feynman once said:
44
+ "*I think I can safely say that nobody understands quantum mechanics.*". Because he couldn't explain it to a freshman.
45
 
46
  Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series
47
  solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a
48
  desire to provide a thorough and in-depth understanding of each component of the Transformer model.
49
 
50
+ Note: *HuggingFace provides an exceptional [tutorial on Transformer models](https://huggingface.co/docs/transformers/index).
51
+ That tutorial is particularly beneficial for readers willing to dive into advanced topics.*
52
  """)
53
 
54
  with st.expander("Copernicus Museum in Warsaw"):
 
59
  illustrates the neuron activation process during the recognition of handwritten digits!**
60
 
61
  Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first
62
+ establishing a firm understanding of tokenisation and embeddings. This foundation will equip us with the knowledge
63
  needed to delve into the more complex aspects of these models later on.
64
 
65
  I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided
 
70
 
71
  divider()
72
 
73
+
74
  st.header("Tokenisers and Tokenisation")
75
 
76
  st.write("""Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP)
 
82
 
83
  from transformers import AutoTokenizer
84
 
85
+ sentence = st.text_input("Consider the sentence: (you can change it):", value="Tokenising text is a fundamental step for NLP models.")
86
  sentence_split = sentence.split()
87
  tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
88
  sentence_tokenise_bert = tokenizer.tokenize(sentence)
89
  sentence_encode_bert = tokenizer.encode(sentence)
90
  sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
91
 
92
+ st.write(f"""\
93
+ A basic word-level tokenisation, which splits a text by spaces, would produce next tokens:
 
 
 
 
 
 
 
94
  """)
95
  st.code(f"""
96
  {sentence_split}
97
  """)
98
 
99
 
100
+ st.write(f"""\
101
+ However, we notice that the punctuation may attached to the words. It is disadvantageous, how the tokenization dealt with the word "Don't".
102
+ "Don't" stands for "do not", so it would be better tokenized as ["Do", "n't"]. (Hint: try another sentence: "I musn't tell lies. Don't do this.") This is where things start getting complicated,
103
+ and part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text,
104
+ a different tokenized output is generated for the same text.
105
+ A more sophisticated algorithm, with several optimizations, might generate a different set of tokens: """)
106
  st.code(f"""
107
  {sentence_tokenise_bert}
108
  """)
109
 
110
+ with st.expander("click here to look at the Python code:"):
111
  st.code(f"""\
112
  from transformers import AutoTokenizer
113
 
114
+ sentence = "{sentence}"
115
  sentence_split = sentence.split()
116
  tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
117
  sentence_tokenise_bert = tokenizer.tokenize(sentence)
 
144
  then taking the exam in French or Spanish. You might get lucky, but it's a considerable risk.
145
  """)
146
 
147
+ training_dataset = """\
148
+ Beautiful is better than ugly.
149
+ Explicit is better than implicit.
150
+ Simple is better than complex.
151
+ Complex is better than complicated.
152
+ Flat is better than nested.
153
+ Sparse is better than dense.
154
+ Readability counts.
155
+ """
156
+
157
+ tokeniser_name = st.selectbox(label="Choose your tokeniser", options=["BPE", 'Unigram', 'WordPiece'])
158
+ if tokeniser_name == 'BPE':
159
+ st.subheader("Byte-Pair Encoding (BPE)")
160
+ st.write("""\
161
+ Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword
162
+ Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the
163
+ training data into words. Pre-tokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. More
164
+ advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most
165
+ languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.
166
+
167
+ After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the
168
+ training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the
169
+ set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so
170
+ until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a
171
+ hyperparameter to define before training the tokenizer.
172
+
173
+ As an example, let’s assume that after pre-tokenization, the following set of words including their frequency has
174
+ been determined:
175
+ """)
176
+ st.code(""" ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) """)
177
+ st.write("""\
178
+ Consequently, the base vocabulary is ["b", "g", "h", "n", "p", "s", "u"]. Splitting all words into symbols of the base vocabulary, we obtain:
179
+ """)
180
+ st.code(""" ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) """)
181
+ st.write("""\
182
+ BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs
183
+ most frequently. In the example above "h" followed by "u" is present 10 + 5 = 15 times (10 times in the 10
184
+ occurrences of "hug", 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is "u"
185
+ followed by "g", occurring 10 + 5 + 5 = 20 times in total. Thus, the first merge rule the tokenizer learns is to
186
+ group all "u" symbols followed by a "g" symbol together. Next, "ug" is added to the vocabulary. The set of words
187
+ then becomes
188
+ """)
189
+ st.code(""" ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) """)
190
+ st.write("""\
191
+ BPE then identifies the next most common symbol pair. It’s "u" followed by "n", which occurs 16
192
+ times. "u", "n" is merged to "un" and added to the vocabulary. The next most frequent symbol pair is "h" followed
193
+ by "ug", occurring 15 times. Again the pair is merged and "hug" can be added to the vocabulary.
194
+
195
+ At this stage, the vocabulary is ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] and our set of unique words is represented as
196
+ """)
197
+ st.code(""" ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) """)
198
+ st.write("""\
199
+ Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules
200
+ would then be applied to new words (as long as those new words do not include symbols that were not in the base
201
+ vocabulary). For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as [
202
+ "[unk]", "ug"] since the symbol "m" is not in the base vocabulary. In general, single letters such as "m" are not
203
+ replaced by the "[unk]" symbol because the training data usually includes at least one occurrence of each letter,
204
+ but it is likely to happen for very special characters like emojis.
205
+
206
+ As mentioned earlier, the vocabulary size, i.e. the base vocabulary size + the number of merges, is a hyperparameter
207
+ to choose. For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop
208
+ training after 40,000 merges.
209
+ """)
210
+
211
+
212
+ st.subheader("Try Yourself:")
213
+ st.write(f"""\
214
+ *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the
215
+ efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence
216
+ more understandable and containing less tokens (ids)*
217
+ """)
218
+
219
  training_dataset = st.text_area("*Training Dataset - Vocabulary:*", value=training_dataset, height=200)
220
  training_dataset = training_dataset.split('\n')
221
  vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
222
+ sentence = st.text_input(label="*Text to tokenise:*",
223
+ value="[CLS] Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
224
 
225
 
 
226
  from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
227
  tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
 
228
  tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
229
  tokenizer.decoder = decoders.ByteLevel()
230
  trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocabulary_size)
 
 
 
 
 
 
 
231
  tokenizer.train_from_iterator(training_dataset, trainer=trainer)
 
 
232
  output = tokenizer.encode(sentence)
233
 
234
  st.write("*Tokens:*")
 
238
  attention_mast: {output.attention_mask}
239
  """)
240
 
241
+ st.write(""" *well done if you get ids like these: [1, 57, 49, 28, 10, 58, 55, 52, 31, 54, 5, 2, 3, 3, 3]!*""")
242
+
243
+ with st.expander("Python code:"):
244
+ st.code(f"""
245
+ from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
246
+
247
+ tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
248
+ tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
249
+ tokenizer.decoder = decoders.ByteLevel()
250
+ trainer = trainers.BpeTrainer(
251
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
252
+ vocab_size={vocabulary_size})
253
+ training_dataset = {training_dataset}
254
+ tokenizer.train_from_iterator(training_dataset, trainer=trainer)
255
+ output = tokenizer.encode("{sentence}")
256
+ """, language='python')
257
+ elif tokeniser_name == 'Unigram':
258
+ st.subheader("""Unigram""")
259
+ st.write("""\
260
+ Unigram is a subword tokenization algorithm introduced in [Subword Regularization: Improving Neural
261
+ Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf).
262
+ In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and
263
+ progressively trims down each symbol to obtain a smaller vocabulary. The base vocabulary could for instance
264
+ correspond to all pre-tokenized words and the most common substrings. Unigram is not used directly for any of the
265
+ models in the transformers, but it’s used in conjunction with SentencePiece.
266
+
267
+ At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
268
+ data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary,
269
+ the algorithm computes how much the overall loss would increase if the symbol was to be removed from the vocabulary.
270
+ Unigram then removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest,
271
+ i.e. those symbols that least affect the overall loss over the training data. This process is repeated until the
272
+ vocabulary has reached the desired size. The Unigram algorithm always keeps the base characters so that any word can
273
+ be tokenized.
274
+
275
+ Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
276
+ tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
277
+ """)
278
+ st.code(""" ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] """)
279
+ st.write("""\
280
+ "hugs" could be tokenized both as ["hug", "s"], ["h", "ug", "s"] or ["h", "u", "g", "s"]. So which
281
+ one to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary
282
+ so that the probability of each possible tokenization can be computed after training. The algorithm simply picks
283
+ the most likely tokenization in practice, but also offers the possibility to sample a possible tokenization
284
+ according to their probabilities.
285
+ """)
286
 
287
+ st.subheader("Try Yourself:")
288
+ st.write(f"""\
289
+ *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the
290
+ efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence
291
+ more understandable and containing less tokens (ids)*
292
+ """)
293
+
294
+ training_dataset = st.text_area("*Training Dataset - Vocabulary(change it and looks at resulted tokens):*", value=training_dataset, height=200)
295
+ training_dataset = training_dataset.split('\n')
296
+ vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
297
+ sentence = st.text_input(label="*Text to tokenise:*",
298
+ value="[CLS] Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
299
+
300
+ from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
301
+
302
+ tokenizer = Tokenizer(models.Unigram())
303
+ tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
304
+ tokenizer.decoder = decoders.ByteLevel()
305
+ trainer = trainers.UnigramTrainer(
306
+ vocab_size=vocabulary_size,
307
+ unk_token="[UNK]",
308
+ # initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
309
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
310
+ )
311
+ tokenizer.train_from_iterator(training_dataset, trainer=trainer)
312
+ output = tokenizer.encode(sentence)
313
+
314
+ # TODO: make it more visible, container with a differect color or something
315
+ st.write("*Tokens:*")
316
+ st.code(f"""{output.tokens}""")
317
+ st.code(f"""\
318
+ ids: {output.ids}
319
+ attention_mast: {output.attention_mask}
320
+ """)
321
+
322
+ st.write(""" *well done if you get ids like these: [1, 57, 49, 28, 10, 58, 55, 52, 31, 54, 5, 2, 3, 3, 3]!*""")
323
+ with st.expander("Python code:"):
324
+ st.code(f"""\
325
+ from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
326
+
327
+ tokenizer = Tokenizer(models.Unigram())
328
+ tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
329
+ tokenizer.decoder = decoders.ByteLevel()
330
+ trainer = trainers.UnigramTrainer(
331
+ vocab_size={vocabulary_size},
332
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
333
+ )
334
+ training_dataset = {training_dataset}
335
+ tokenizer.train_from_iterator(training_dataset, trainer=trainer)
336
+ output = tokenizer.encode("{sentence}")
337
+ """, language='python')
338
+ elif tokeniser_name == 'WordPiece':
339
+ st.subheader("""WordPiece""")
340
+ st.write("""\
341
+ WordPiece is the subword tokenization algorithm used for BERT, DistilBERT, and Electra. The
342
+ algorithm was outlined in [Japanese and Korean Voice Search (Schuster et al.,
343
+ 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very
344
+ similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training
345
+ data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the
346
+ most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the
347
+ vocabulary.
348
+
349
+ So what does this mean exactly? Referring to the example from BPE tokeniser, maximizing the likelihood of the training data is
350
+ equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
351
+ its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if
352
+ the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively,
353
+ WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth
354
+ it.
355
+ """)
356
 
357
  st.subheader("Try Yourself:")
358
+ st.write(f"""\
359
+ *Using text area field below try to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the
360
  efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence
361
+ more understandable and containing less tokens (ids)*
362
  """)
363
 
364
+ training_dataset = st.text_area("*Training Dataset - Vocabulary(change it and looks at resulted tokens):*",
365
+ value=training_dataset, height=200)
366
+ training_dataset = training_dataset.split('\n')
367
+ vocabulary_size = st.number_input("Vocabulary Size:", value=100000)
368
+ sentence = st.text_input(label="*Text to tokenise:*",
369
+ value="[CLS] Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
370
 
371
+ from tokenizers import Tokenizer, decoders, models, pre_tokenizers, trainers
372
+
373
+ tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
374
+ tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
375
+ tokenizer.decoder = decoders.ByteLevel()
376
+ trainer = trainers.WordPieceTrainer(
377
+ vocab_size=vocabulary_size,
378
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
379
+ )
380
+ tokenizer.train_from_iterator(training_dataset, trainer=trainer)
381
+ output = tokenizer.encode(sentence)
382
+
383
+ # TODO: make it more visible, container with a differect color or something
384
+ st.write("*Tokens:*")
385
+ st.code(f"""{output.tokens}""")
386
+ st.code(f"""\
387
+ ids: {output.ids}
388
+ attention_mast: {output.attention_mask}
389
+ """)
390
+
391
+ st.write(""" *well done if you get ids like these: [1, 76, 72, 50, 10, 77, 71, 68, 66, 78, 5, 2, 3, 3, 3]!*""")
392
+ with st.expander("Python code:"):
393
+ st.code(f"""\
394
+ from tokenizers import Tokenizer, decoders, models, pre_tokenizers, trainers
395
+
396
+ tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
397
+ trainer = trainers.WordPieceTrainer(
398
+ vocab_size={vocabulary_size},
399
+ special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
400
+ )
401
+ training_dataset = {training_dataset}
402
+ tokenizer.train_from_iterator(training_dataset, trainer=trainer)
403
+ output = tokenizer.encode("{sentence}")
404
+ """, language='python')
405
+
406
+
407
+ with st.expander("Special tokens meaning:"):
408
+ st.write("""\
409
+ \\#\\# prefix: It means that the preceding string is not whitespace, any token with this prefix should be
410
+ merged with the previous token when you convert the tokens back to a string.
411
+
412
+ [UNK]: Stands for "unknown". This token is used to represent any word that is not in the model's vocabulary. Since
413
+ most models have a fixed-size vocabulary, it's not possible to have a unique token for every possible word. The [UNK]
414
+ token is used as a catch-all for any words the model hasn't seen before. E.g. in our example we 'decided' that Large
415
+ Language (LL) abbreviation is not part of the model's vocabulary.
416
+
417
+ [CLS]: Stands for "classification". In models like BERT, this token is added at the beginning of every input
418
+ sequence. The representation (embedding) of this token is used as the aggregate sequence representation for
419
+ classification tasks. In other words, the model is trained to encode the meaning of the entire sequence into this token.
420
+
421
+ [SEP]: Stands for "separator". This token is used to separate different sequences when the model needs to take more
422
+ than one input sequence. For example, in question-answering tasks, the model takes two inputs: a question and a
423
+ passage that contains the answer. The two inputs are separated by a [SEP] token.
424
+
425
+ [MASK]: This token is specific to models like BERT, which are trained with a masked language modelling objective.
426
+ During training, some percentage of the input tokens are replaced with the [MASK] token, and the model's goal is to
427
+ predict the original value of the masked tokens.
428
+
429
+ [PAD]: Stands for "padding". This token is used to fill in the extra spaces when batching sequences of different
430
+ lengths together. Since models require input sequences to be the same length, shorter sequences are extended with [
431
+ PAD] tokens. In our example, we extended the length of the input sequence to 16 tokens.
432
+ """)
433
 
434
 
435
  with st.expander("References:"):
436
  st.write("""\
437
+ - https://huggingface.co/docs/transformers/tokenizer_summary
438
+ - https://huggingface.co/docs/tokenizers/training_from_memory
 
 
439
  """)
440
 
441
  divider()