googcheng commited on
Commit
2770ddc
1 Parent(s): 994cc5a

Upload 2 files

Browse files
Files changed (2) hide show
  1. ali_token.ipynb +1584 -0
  2. tiktoken_test.ipynb +0 -0
ali_token.ipynb ADDED
@@ -0,0 +1,1584 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 76,
6
+ "id": "d5c3dff6-bd21-4e6a-8f8b-a83dc6895e08",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "from transformers import AutoTokenizer\n",
11
+ "tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)"
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "code",
16
+ "execution_count": 36,
17
+ "id": "549ca852-199a-4d55-9e29-17837dd1f975",
18
+ "metadata": {},
19
+ "outputs": [
20
+ {
21
+ "data": {
22
+ "text/plain": [
23
+ "QWenTokenizer(name_or_path='Qwen/Qwen-7B', vocab_size=151851, model_max_length=8192, is_fast=False, padding_side='right', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=True)"
24
+ ]
25
+ },
26
+ "execution_count": 36,
27
+ "metadata": {},
28
+ "output_type": "execute_result"
29
+ }
30
+ ],
31
+ "source": [
32
+ "tokenizer"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": 37,
38
+ "id": "fe6c9cf7-fbc6-4073-81e9-5dfaefe88fdf",
39
+ "metadata": {},
40
+ "outputs": [
41
+ {
42
+ "data": {
43
+ "text/plain": [
44
+ "151851"
45
+ ]
46
+ },
47
+ "execution_count": 37,
48
+ "metadata": {},
49
+ "output_type": "execute_result"
50
+ }
51
+ ],
52
+ "source": [
53
+ "len(tokenizer)"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": 38,
59
+ "id": "4b583802-0e69-40d2-a32e-745ea50ede63",
60
+ "metadata": {},
61
+ "outputs": [
62
+ {
63
+ "name": "stdout",
64
+ "output_type": "stream",
65
+ "text": [
66
+ "[14990, 5562]\n",
67
+ "text : hello dog\n"
68
+ ]
69
+ }
70
+ ],
71
+ "source": [
72
+ "#tokenizer('hello dog')['input_ids']\n",
73
+ "#tokenizer('hello dog').tokens()\n",
74
+ "tks = tokenizer.encode('hello dog')\n",
75
+ "print(tks)\n",
76
+ "text = tokenizer.decode(tks)\n",
77
+ "print(\"text :\", text)"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "code",
82
+ "execution_count": 39,
83
+ "id": "9b360ba8-692d-4e3c-b9c9-fae3969c3617",
84
+ "metadata": {},
85
+ "outputs": [
86
+ {
87
+ "data": {
88
+ "text/plain": [
89
+ "' �'"
90
+ ]
91
+ },
92
+ "execution_count": 39,
93
+ "metadata": {},
94
+ "output_type": "execute_result"
95
+ }
96
+ ],
97
+ "source": [
98
+ "tokenizer.decode([51461])"
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "code",
103
+ "execution_count": 40,
104
+ "id": "85b41d25-acbd-4818-a8e9-5020a3009b1d",
105
+ "metadata": {},
106
+ "outputs": [
107
+ {
108
+ "data": {
109
+ "text/plain": [
110
+ "' 根'"
111
+ ]
112
+ },
113
+ "execution_count": 40,
114
+ "metadata": {},
115
+ "output_type": "execute_result"
116
+ }
117
+ ],
118
+ "source": [
119
+ "tokenizer.decode([51461, 117])"
120
+ ]
121
+ },
122
+ {
123
+ "cell_type": "code",
124
+ "execution_count": 41,
125
+ "id": "abb10266-3da5-4586-ab6e-8812dc83ce3d",
126
+ "metadata": {},
127
+ "outputs": [
128
+ {
129
+ "data": {
130
+ "text/plain": [
131
+ "{b'!': 0,\n",
132
+ " b'\"': 1,\n",
133
+ " b'#': 2,\n",
134
+ " b'$': 3,\n",
135
+ " b'%': 4,\n",
136
+ " b'&': 5,\n",
137
+ " b\"'\": 6,\n",
138
+ " b'(': 7,\n",
139
+ " b')': 8,\n",
140
+ " b'*': 9,\n",
141
+ " b'+': 10,\n",
142
+ " b',': 11,\n",
143
+ " b'-': 12,\n",
144
+ " b'.': 13,\n",
145
+ " b'/': 14,\n",
146
+ " b'0': 15,\n",
147
+ " b'1': 16,\n",
148
+ " b'2': 17,\n",
149
+ " b'3': 18,\n",
150
+ " b'4': 19,\n",
151
+ " b'5': 20,\n",
152
+ " b'6': 21,\n",
153
+ " b'7': 22,\n",
154
+ " b'8': 23,\n",
155
+ " b'9': 24,\n",
156
+ " b':': 25,\n",
157
+ " b';': 26,\n",
158
+ " b'<': 27,\n",
159
+ " b'=': 28,\n",
160
+ " b'>': 29,\n",
161
+ " b'?': 30,\n",
162
+ " b'@': 31,\n",
163
+ " b'A': 32,\n",
164
+ " b'B': 33,\n",
165
+ " b'C': 34,\n",
166
+ " b'D': 35,\n",
167
+ " b'E': 36,\n",
168
+ " b'F': 37,\n",
169
+ " b'G': 38,\n",
170
+ " b'H': 39,\n",
171
+ " b'I': 40,\n",
172
+ " b'J': 41,\n",
173
+ " b'K': 42,\n",
174
+ " b'L': 43,\n",
175
+ " b'M': 44,\n",
176
+ " b'N': 45,\n",
177
+ " b'O': 46,\n",
178
+ " b'P': 47,\n",
179
+ " b'Q': 48,\n",
180
+ " b'R': 49,\n",
181
+ " b'S': 50,\n",
182
+ " b'T': 51,\n",
183
+ " b'U': 52,\n",
184
+ " b'V': 53,\n",
185
+ " b'W': 54,\n",
186
+ " b'X': 55,\n",
187
+ " b'Y': 56,\n",
188
+ " b'Z': 57,\n",
189
+ " b'[': 58,\n",
190
+ " b'\\\\': 59,\n",
191
+ " b']': 60,\n",
192
+ " b'^': 61,\n",
193
+ " b'_': 62,\n",
194
+ " b'`': 63,\n",
195
+ " b'a': 64,\n",
196
+ " b'b': 65,\n",
197
+ " b'c': 66,\n",
198
+ " b'd': 67,\n",
199
+ " b'e': 68,\n",
200
+ " b'f': 69,\n",
201
+ " b'g': 70,\n",
202
+ " b'h': 71,\n",
203
+ " b'i': 72,\n",
204
+ " b'j': 73,\n",
205
+ " b'k': 74,\n",
206
+ " b'l': 75,\n",
207
+ " b'm': 76,\n",
208
+ " b'n': 77,\n",
209
+ " b'o': 78,\n",
210
+ " b'p': 79,\n",
211
+ " b'q': 80,\n",
212
+ " b'r': 81,\n",
213
+ " b's': 82,\n",
214
+ " b't': 83,\n",
215
+ " b'u': 84,\n",
216
+ " b'v': 85,\n",
217
+ " b'w': 86,\n",
218
+ " b'x': 87,\n",
219
+ " b'y': 88,\n",
220
+ " b'z': 89,\n",
221
+ " b'{': 90,\n",
222
+ " b'|': 91,\n",
223
+ " b'}': 92,\n",
224
+ " b'~': 93,\n",
225
+ " b'\\xa1': 94,\n",
226
+ " b'\\xa2': 95,\n",
227
+ " b'\\xa3': 96,\n",
228
+ " b'\\xa4': 97,\n",
229
+ " b'\\xa5': 98,\n",
230
+ " b'\\xa6': 99,\n",
231
+ " b'\\xa7': 100,\n",
232
+ " b'\\xa8': 101,\n",
233
+ " b'\\xa9': 102,\n",
234
+ " b'\\xaa': 103,\n",
235
+ " b'\\xab': 104,\n",
236
+ " b'\\xac': 105,\n",
237
+ " b'\\xae': 106,\n",
238
+ " b'\\xaf': 107,\n",
239
+ " b'\\xb0': 108,\n",
240
+ " b'\\xb1': 109,\n",
241
+ " b'\\xb2': 110,\n",
242
+ " b'\\xb3': 111,\n",
243
+ " b'\\xb4': 112,\n",
244
+ " b'\\xb5': 113,\n",
245
+ " b'\\xb6': 114,\n",
246
+ " b'\\xb7': 115,\n",
247
+ " b'\\xb8': 116,\n",
248
+ " b'\\xb9': 117,\n",
249
+ " b'\\xba': 118,\n",
250
+ " b'\\xbb': 119,\n",
251
+ " b'\\xbc': 120,\n",
252
+ " b'\\xbd': 121,\n",
253
+ " b'\\xbe': 122,\n",
254
+ " b'\\xbf': 123,\n",
255
+ " b'\\xc0': 124,\n",
256
+ " b'\\xc1': 125,\n",
257
+ " b'\\xc2': 126,\n",
258
+ " b'\\xc3': 127,\n",
259
+ " b'\\xc4': 128,\n",
260
+ " b'\\xc5': 129,\n",
261
+ " b'\\xc6': 130,\n",
262
+ " b'\\xc7': 131,\n",
263
+ " b'\\xc8': 132,\n",
264
+ " b'\\xc9': 133,\n",
265
+ " b'\\xca': 134,\n",
266
+ " b'\\xcb': 135,\n",
267
+ " b'\\xcc': 136,\n",
268
+ " b'\\xcd': 137,\n",
269
+ " b'\\xce': 138,\n",
270
+ " b'\\xcf': 139,\n",
271
+ " b'\\xd0': 140,\n",
272
+ " b'\\xd1': 141,\n",
273
+ " b'\\xd2': 142,\n",
274
+ " b'\\xd3': 143,\n",
275
+ " b'\\xd4': 144,\n",
276
+ " b'\\xd5': 145,\n",
277
+ " b'\\xd6': 146,\n",
278
+ " b'\\xd7': 147,\n",
279
+ " b'\\xd8': 148,\n",
280
+ " b'\\xd9': 149,\n",
281
+ " b'\\xda': 150,\n",
282
+ " b'\\xdb': 151,\n",
283
+ " b'\\xdc': 152,\n",
284
+ " b'\\xdd': 153,\n",
285
+ " b'\\xde': 154,\n",
286
+ " b'\\xdf': 155,\n",
287
+ " b'\\xe0': 156,\n",
288
+ " b'\\xe1': 157,\n",
289
+ " b'\\xe2': 158,\n",
290
+ " b'\\xe3': 159,\n",
291
+ " b'\\xe4': 160,\n",
292
+ " b'\\xe5': 161,\n",
293
+ " b'\\xe6': 162,\n",
294
+ " b'\\xe7': 163,\n",
295
+ " b'\\xe8': 164,\n",
296
+ " b'\\xe9': 165,\n",
297
+ " b'\\xea': 166,\n",
298
+ " b'\\xeb': 167,\n",
299
+ " b'\\xec': 168,\n",
300
+ " b'\\xed': 169,\n",
301
+ " b'\\xee': 170,\n",
302
+ " b'\\xef': 171,\n",
303
+ " b'\\xf0': 172,\n",
304
+ " b'\\xf1': 173,\n",
305
+ " b'\\xf2': 174,\n",
306
+ " b'\\xf3': 175,\n",
307
+ " b'\\xf4': 176,\n",
308
+ " b'\\xf5': 177,\n",
309
+ " b'\\xf6': 178,\n",
310
+ " b'\\xf7': 179,\n",
311
+ " b'\\xf8': 180,\n",
312
+ " b'\\xf9': 181,\n",
313
+ " b'\\xfa': 182,\n",
314
+ " b'\\xfb': 183,\n",
315
+ " b'\\xfc': 184,\n",
316
+ " b'\\xfd': 185,\n",
317
+ " b'\\xfe': 186,\n",
318
+ " b'\\xff': 187,\n",
319
+ " b'\\x00': 188,\n",
320
+ " b'\\x01': 189,\n",
321
+ " b'\\x02': 190,\n",
322
+ " b'\\x03': 191,\n",
323
+ " b'\\x04': 192,\n",
324
+ " b'\\x05': 193,\n",
325
+ " b'\\x06': 194,\n",
326
+ " b'\\x07': 195,\n",
327
+ " b'\\x08': 196,\n",
328
+ " b'\\t': 197,\n",
329
+ " b'\\n': 198,\n",
330
+ " b'\\x0b': 199,\n",
331
+ " b'\\x0c': 200,\n",
332
+ " b'\\r': 201,\n",
333
+ " b'\\x0e': 202,\n",
334
+ " b'\\x0f': 203,\n",
335
+ " b'\\x10': 204,\n",
336
+ " b'\\x11': 205,\n",
337
+ " b'\\x12': 206,\n",
338
+ " b'\\x13': 207,\n",
339
+ " b'\\x14': 208,\n",
340
+ " b'\\x15': 209,\n",
341
+ " b'\\x16': 210,\n",
342
+ " b'\\x17': 211,\n",
343
+ " b'\\x18': 212,\n",
344
+ " b'\\x19': 213,\n",
345
+ " b'\\x1a': 214,\n",
346
+ " b'\\x1b': 215,\n",
347
+ " b'\\x1c': 216,\n",
348
+ " b'\\x1d': 217,\n",
349
+ " b'\\x1e': 218,\n",
350
+ " b'\\x1f': 219,\n",
351
+ " b' ': 220,\n",
352
+ " b'\\x7f': 221,\n",
353
+ " b'\\x80': 222,\n",
354
+ " b'\\x81': 223,\n",
355
+ " b'\\x82': 224,\n",
356
+ " b'\\x83': 225,\n",
357
+ " b'\\x84': 226,\n",
358
+ " b'\\x85': 227,\n",
359
+ " b'\\x86': 228,\n",
360
+ " b'\\x87': 229,\n",
361
+ " b'\\x88': 230,\n",
362
+ " b'\\x89': 231,\n",
363
+ " b'\\x8a': 232,\n",
364
+ " b'\\x8b': 233,\n",
365
+ " b'\\x8c': 234,\n",
366
+ " b'\\x8d': 235,\n",
367
+ " b'\\x8e': 236,\n",
368
+ " b'\\x8f': 237,\n",
369
+ " b'\\x90': 238,\n",
370
+ " b'\\x91': 239,\n",
371
+ " b'\\x92': 240,\n",
372
+ " b'\\x93': 241,\n",
373
+ " b'\\x94': 242,\n",
374
+ " b'\\x95': 243,\n",
375
+ " b'\\x96': 244,\n",
376
+ " b'\\x97': 245,\n",
377
+ " b'\\x98': 246,\n",
378
+ " b'\\x99': 247,\n",
379
+ " b'\\x9a': 248,\n",
380
+ " b'\\x9b': 249,\n",
381
+ " b'\\x9c': 250,\n",
382
+ " b'\\x9d': 251,\n",
383
+ " b'\\x9e': 252,\n",
384
+ " b'\\x9f': 253,\n",
385
+ " b'\\xa0': 254,\n",
386
+ " b'\\xad': 255,\n",
387
+ " b' ': 256,\n",
388
+ " b' ': 257,\n",
389
+ " b'in': 258,\n",
390
+ " b' t': 259,\n",
391
+ " b' ': 260,\n",
392
+ " b'er': 261,\n",
393
+ " b' ': 262,\n",
394
+ " b'on': 263,\n",
395
+ " b' a': 264,\n",
396
+ " b're': 265,\n",
397
+ " b'at': 266,\n",
398
+ " b'st': 267,\n",
399
+ " b'en': 268,\n",
400
+ " b'or': 269,\n",
401
+ " b' th': 270,\n",
402
+ " b'\\n\\n': 271,\n",
403
+ " b' c': 272,\n",
404
+ " b'le': 273,\n",
405
+ " b' s': 274,\n",
406
+ " b'it': 275,\n",
407
+ " b'an': 276,\n",
408
+ " b'ar': 277,\n",
409
+ " b'al': 278,\n",
410
+ " b' the': 279,\n",
411
+ " b';\\n': 280,\n",
412
+ " b' p': 281,\n",
413
+ " b' f': 282,\n",
414
+ " b'ou': 283,\n",
415
+ " b' =': 284,\n",
416
+ " b'is': 285,\n",
417
+ " b' ': 286,\n",
418
+ " b'ing': 287,\n",
419
+ " b'es': 288,\n",
420
+ " b' w': 289,\n",
421
+ " b'ion': 290,\n",
422
+ " b'ed': 291,\n",
423
+ " b'ic': 292,\n",
424
+ " b' b': 293,\n",
425
+ " b' d': 294,\n",
426
+ " b'et': 295,\n",
427
+ " b' m': 296,\n",
428
+ " b' o': 297,\n",
429
+ " b'\\t\\t': 298,\n",
430
+ " b'ro': 299,\n",
431
+ " b'as': 300,\n",
432
+ " b'el': 301,\n",
433
+ " b'ct': 302,\n",
434
+ " b'nd': 303,\n",
435
+ " b' in': 304,\n",
436
+ " b' h': 305,\n",
437
+ " b'ent': 306,\n",
438
+ " b'id': 307,\n",
439
+ " b' n': 308,\n",
440
+ " b'am': 309,\n",
441
+ " b' ': 310,\n",
442
+ " b' to': 311,\n",
443
+ " b' re': 312,\n",
444
+ " b'--': 313,\n",
445
+ " b' {': 314,\n",
446
+ " b' of': 315,\n",
447
+ " b'om': 316,\n",
448
+ " b');\\n': 317,\n",
449
+ " b'im': 318,\n",
450
+ " b'\\r\\n': 319,\n",
451
+ " b' (': 320,\n",
452
+ " b'il': 321,\n",
453
+ " b'//': 322,\n",
454
+ " b' and': 323,\n",
455
+ " b'ur': 324,\n",
456
+ " b'se': 325,\n",
457
+ " b' l': 326,\n",
458
+ " b'ex': 327,\n",
459
+ " b' S': 328,\n",
460
+ " b'ad': 329,\n",
461
+ " b' \"': 330,\n",
462
+ " b'ch': 331,\n",
463
+ " b'ut': 332,\n",
464
+ " b'if': 333,\n",
465
+ " b'**': 334,\n",
466
+ " b' }': 335,\n",
467
+ " b'em': 336,\n",
468
+ " b'ol': 337,\n",
469
+ " b' ': 338,\n",
470
+ " b'th': 339,\n",
471
+ " b')\\n': 340,\n",
472
+ " b' {\\n': 341,\n",
473
+ " b' g': 342,\n",
474
+ " b'ig': 343,\n",
475
+ " b'iv': 344,\n",
476
+ " b',\\n': 345,\n",
477
+ " b'ce': 346,\n",
478
+ " b'od': 347,\n",
479
+ " b' v': 348,\n",
480
+ " b'ate': 349,\n",
481
+ " b' T': 350,\n",
482
+ " b'ag': 351,\n",
483
+ " b'ay': 352,\n",
484
+ " b' *': 353,\n",
485
+ " b'ot': 354,\n",
486
+ " b'us': 355,\n",
487
+ " b' C': 356,\n",
488
+ " b' st': 357,\n",
489
+ " b' I': 358,\n",
490
+ " b'un': 359,\n",
491
+ " b'ul': 360,\n",
492
+ " b'ue': 361,\n",
493
+ " b' A': 362,\n",
494
+ " b'ow': 363,\n",
495
+ " b\" '\": 364,\n",
496
+ " b'ew': 365,\n",
497
+ " b' <': 366,\n",
498
+ " b'ation': 367,\n",
499
+ " b'()': 368,\n",
500
+ " b' for': 369,\n",
501
+ " b'ab': 370,\n",
502
+ " b'ort': 371,\n",
503
+ " b'um': 372,\n",
504
+ " b'ame': 373,\n",
505
+ " b' is': 374,\n",
506
+ " b'pe': 375,\n",
507
+ " b'tr': 376,\n",
508
+ " b'ck': 377,\n",
509
+ " b'\\xe2\\x80': 378,\n",
510
+ " b' y': 379,\n",
511
+ " b'ist': 380,\n",
512
+ " b'----': 381,\n",
513
+ " b'.\\n\\n': 382,\n",
514
+ " b'he': 383,\n",
515
+ " b' e': 384,\n",
516
+ " b'lo': 385,\n",
517
+ " b' M': 386,\n",
518
+ " b' be': 387,\n",
519
+ " b'ers': 388,\n",
520
+ " b' on': 389,\n",
521
+ " b' con': 390,\n",
522
+ " b'ap': 391,\n",
523
+ " b'ub': 392,\n",
524
+ " b' P': 393,\n",
525
+ " b' ': 394,\n",
526
+ " b'ass': 395,\n",
527
+ " b'int': 396,\n",
528
+ " b'>\\n': 397,\n",
529
+ " b'ly': 398,\n",
530
+ " b'urn': 399,\n",
531
+ " b' $': 400,\n",
532
+ " b';\\n\\n': 401,\n",
533
+ " b'av': 402,\n",
534
+ " b'port': 403,\n",
535
+ " b'ir': 404,\n",
536
+ " b'->': 405,\n",
537
+ " b'nt': 406,\n",
538
+ " b'ction': 407,\n",
539
+ " b'end': 408,\n",
540
+ " b' de': 409,\n",
541
+ " b'ith': 410,\n",
542
+ " b'out': 411,\n",
543
+ " b'turn': 412,\n",
544
+ " b'our': 413,\n",
545
+ " b' ': 414,\n",
546
+ " b'lic': 415,\n",
547
+ " b'res': 416,\n",
548
+ " b'pt': 417,\n",
549
+ " b'==': 418,\n",
550
+ " b' this': 419,\n",
551
+ " b' wh': 420,\n",
552
+ " b' if': 421,\n",
553
+ " b' D': 422,\n",
554
+ " b'ver': 423,\n",
555
+ " b'age': 424,\n",
556
+ " b' B': 425,\n",
557
+ " b'ht': 426,\n",
558
+ " b'ext': 427,\n",
559
+ " b'=\"': 428,\n",
560
+ " b' that': 429,\n",
561
+ " b'****': 430,\n",
562
+ " b' R': 431,\n",
563
+ " b' it': 432,\n",
564
+ " b'ess': 433,\n",
565
+ " b' F': 434,\n",
566
+ " b' r': 435,\n",
567
+ " b'os': 436,\n",
568
+ " b'and': 437,\n",
569
+ " b' as': 438,\n",
570
+ " b'ect': 439,\n",
571
+ " b'ke': 440,\n",
572
+ " b'rom': 441,\n",
573
+ " b' //': 442,\n",
574
+ " b'con': 443,\n",
575
+ " b' L': 444,\n",
576
+ " b'(\"': 445,\n",
577
+ " b'qu': 446,\n",
578
+ " b'lass': 447,\n",
579
+ " b' with': 448,\n",
580
+ " b'iz': 449,\n",
581
+ " b'de': 450,\n",
582
+ " b' N': 451,\n",
583
+ " b' al': 452,\n",
584
+ " b'op': 453,\n",
585
+ " b'up': 454,\n",
586
+ " b'get': 455,\n",
587
+ " b' }\\n': 456,\n",
588
+ " b'ile': 457,\n",
589
+ " b' an': 458,\n",
590
+ " b'ata': 459,\n",
591
+ " b'ore': 460,\n",
592
+ " b'ri': 461,\n",
593
+ " b' pro': 462,\n",
594
+ " b';\\r\\n': 463,\n",
595
+ " b'\\t\\t\\t\\t': 464,\n",
596
+ " b'ter': 465,\n",
597
+ " b'ain': 466,\n",
598
+ " b' W': 467,\n",
599
+ " b' E': 468,\n",
600
+ " b' com': 469,\n",
601
+ " b' return': 470,\n",
602
+ " b'art': 471,\n",
603
+ " b' H': 472,\n",
604
+ " b'ack': 473,\n",
605
+ " b'import': 474,\n",
606
+ " b'ublic': 475,\n",
607
+ " b' or': 476,\n",
608
+ " b'est': 477,\n",
609
+ " b'ment': 478,\n",
610
+ " b' G': 479,\n",
611
+ " b'able': 480,\n",
612
+ " b' -': 481,\n",
613
+ " b'ine': 482,\n",
614
+ " b'ill': 483,\n",
615
+ " b'ind': 484,\n",
616
+ " b'ere': 485,\n",
617
+ " b'::': 486,\n",
618
+ " b'ity': 487,\n",
619
+ " b' +': 488,\n",
620
+ " b' tr': 489,\n",
621
+ " b'elf': 490,\n",
622
+ " b'ight': 491,\n",
623
+ " b\"('\": 492,\n",
624
+ " b'orm': 493,\n",
625
+ " b'ult': 494,\n",
626
+ " b'str': 495,\n",
627
+ " b'..': 496,\n",
628
+ " b'\",': 497,\n",
629
+ " b' you': 498,\n",
630
+ " b'ype': 499,\n",
631
+ " b'pl': 500,\n",
632
+ " b' new': 501,\n",
633
+ " b' j': 502,\n",
634
+ " b' ': 503,\n",
635
+ " b' from': 504,\n",
636
+ " b' ex': 505,\n",
637
+ " b' O': 506,\n",
638
+ " b'ld': 507,\n",
639
+ " b' [': 508,\n",
640
+ " b'oc': 509,\n",
641
+ " b':\\n': 510,\n",
642
+ " b' se': 511,\n",
643
+ " b' le': 512,\n",
644
+ " b'--------': 513,\n",
645
+ " b'.s': 514,\n",
646
+ " b'{\\n': 515,\n",
647
+ " b\"',\": 516,\n",
648
+ " b'ant': 517,\n",
649
+ " b' at': 518,\n",
650
+ " b'ase': 519,\n",
651
+ " b'.c': 520,\n",
652
+ " b' ch': 521,\n",
653
+ " b'</': 522,\n",
654
+ " b'ave': 523,\n",
655
+ " b'ang': 524,\n",
656
+ " b' are': 525,\n",
657
+ " b' int': 526,\n",
658
+ " b'\\xe2\\x80\\x99': 527,\n",
659
+ " b'_t': 528,\n",
660
+ " b'ert': 529,\n",
661
+ " b'ial': 530,\n",
662
+ " b'act': 531,\n",
663
+ " b'}\\n': 532,\n",
664
+ " b'ive': 533,\n",
665
+ " b'ode': 534,\n",
666
+ " b'ost': 535,\n",
667
+ " b' class': 536,\n",
668
+ " b' not': 537,\n",
669
+ " b'og': 538,\n",
670
+ " b'ord': 539,\n",
671
+ " b'alue': 540,\n",
672
+ " b'all': 541,\n",
673
+ " b'ff': 542,\n",
674
+ " b'();\\n': 543,\n",
675
+ " b'ont': 544,\n",
676
+ " b'ime': 545,\n",
677
+ " b'are': 546,\n",
678
+ " b' U': 547,\n",
679
+ " b' pr': 548,\n",
680
+ " b' :': 549,\n",
681
+ " b'ies': 550,\n",
682
+ " b'ize': 551,\n",
683
+ " b'ure': 552,\n",
684
+ " b' by': 553,\n",
685
+ " b'ire': 554,\n",
686
+ " b' }\\n\\n': 555,\n",
687
+ " b'.p': 556,\n",
688
+ " b' sh': 557,\n",
689
+ " b'ice': 558,\n",
690
+ " b'ast': 559,\n",
691
+ " b'ption': 560,\n",
692
+ " b'tring': 561,\n",
693
+ " b'ok': 562,\n",
694
+ " b'__': 563,\n",
695
+ " b'cl': 564,\n",
696
+ " b'##': 565,\n",
697
+ " b' he': 566,\n",
698
+ " b'ard': 567,\n",
699
+ " b').': 568,\n",
700
+ " b' @': 569,\n",
701
+ " b'iew': 570,\n",
702
+ " b'\\t\\t\\t': 571,\n",
703
+ " b' was': 572,\n",
704
+ " b'ip': 573,\n",
705
+ " b'this': 574,\n",
706
+ " b' u': 575,\n",
707
+ " b' The': 576,\n",
708
+ " b'ide': 577,\n",
709
+ " b'ace': 578,\n",
710
+ " b'ib': 579,\n",
711
+ " b'ac': 580,\n",
712
+ " b'rou': 581,\n",
713
+ " b' we': 582,\n",
714
+ " b'ject': 583,\n",
715
+ " b' public': 584,\n",
716
+ " b'ak': 585,\n",
717
+ " b've': 586,\n",
718
+ " b'ath': 587,\n",
719
+ " b'oid': 588,\n",
720
+ " b' =>': 589,\n",
721
+ " b'ust': 590,\n",
722
+ " b'que': 591,\n",
723
+ " b' res': 592,\n",
724
+ " b'))': 593,\n",
725
+ " b\"'s\": 594,\n",
726
+ " b' k': 595,\n",
727
+ " b'ans': 596,\n",
728
+ " b'yst': 597,\n",
729
+ " b'unction': 598,\n",
730
+ " b'********': 599,\n",
731
+ " b' i': 600,\n",
732
+ " b' us': 601,\n",
733
+ " b'pp': 602,\n",
734
+ " b'one': 603,\n",
735
+ " b'ail': 604,\n",
736
+ " b'====': 605,\n",
737
+ " b'name': 606,\n",
738
+ " b' str': 607,\n",
739
+ " b' /': 608,\n",
740
+ " b' &': 609,\n",
741
+ " b'ach': 610,\n",
742
+ " b'div': 611,\n",
743
+ " b'ystem': 612,\n",
744
+ " b'ell': 613,\n",
745
+ " b' have': 614,\n",
746
+ " b'err': 615,\n",
747
+ " b'ould': 616,\n",
748
+ " b'ull': 617,\n",
749
+ " b'pon': 618,\n",
750
+ " b' J': 619,\n",
751
+ " b'_p': 620,\n",
752
+ " b' ==': 621,\n",
753
+ " b'ign': 622,\n",
754
+ " b'St': 623,\n",
755
+ " b'.\\n': 624,\n",
756
+ " b' pl': 625,\n",
757
+ " b');\\n\\n': 626,\n",
758
+ " b'form': 627,\n",
759
+ " b'put': 628,\n",
760
+ " b'ount': 629,\n",
761
+ " b'}\\n\\n': 630,\n",
762
+ " b'dd': 631,\n",
763
+ " b'ite': 632,\n",
764
+ " b' get': 633,\n",
765
+ " b'rr': 634,\n",
766
+ " b'ome': 635,\n",
767
+ " b' \\xe2\\x80': 636,\n",
768
+ " b'aram': 637,\n",
769
+ " b'cc': 638,\n",
770
+ " b' */': 639,\n",
771
+ " b'ER': 640,\n",
772
+ " b'In': 641,\n",
773
+ " b'les': 642,\n",
774
+ " b'_s': 643,\n",
775
+ " b'ong': 644,\n",
776
+ " b'ie': 645,\n",
777
+ " b' can': 646,\n",
778
+ " b' V': 647,\n",
779
+ " b'erv': 648,\n",
780
+ " b'pr': 649,\n",
781
+ " b' un': 650,\n",
782
+ " b'row': 651,\n",
783
+ " b'ber': 652,\n",
784
+ " b' do': 653,\n",
785
+ " b'll': 654,\n",
786
+ " b' el': 655,\n",
787
+ " b' self': 656,\n",
788
+ " b'ated': 657,\n",
789
+ " b'ary': 658,\n",
790
+ " b' .': 659,\n",
791
+ " b\"']\": 660,\n",
792
+ " b'ud': 661,\n",
793
+ " b' en': 662,\n",
794
+ " b' Th': 663,\n",
795
+ " b' ': 664,\n",
796
+ " b'te': 665,\n",
797
+ " b'_c': 666,\n",
798
+ " b'uct': 667,\n",
799
+ " b' ab': 668,\n",
800
+ " b'ork': 669,\n",
801
+ " b'.get': 670,\n",
802
+ " b' #': 671,\n",
803
+ " b'aw': 672,\n",
804
+ " b'ress': 673,\n",
805
+ " b'ob': 674,\n",
806
+ " b'Name': 675,\n",
807
+ " b'app': 676,\n",
808
+ " b\"['\": 677,\n",
809
+ " b' all': 678,\n",
810
+ " b'ory': 679,\n",
811
+ " b'ition': 680,\n",
812
+ " b'ance': 681,\n",
813
+ " b'ear': 682,\n",
814
+ " b' cont': 683,\n",
815
+ " b'vent': 684,\n",
816
+ " b'ia': 685,\n",
817
+ " b' will': 686,\n",
818
+ " b'IN': 687,\n",
819
+ " b' ': 688,\n",
820
+ " b'return': 689,\n",
821
+ " b' </': 690,\n",
822
+ " b'data': 691,\n",
823
+ " b')\\n\\n': 692,\n",
824
+ " b'Re': 693,\n",
825
+ " b'ple': 694,\n",
826
+ " b'ild': 695,\n",
827
+ " b'ther': 696,\n",
828
+ " b' your': 697,\n",
829
+ " b'\"\\n': 698,\n",
830
+ " b'($': 699,\n",
831
+ " b' out': 700,\n",
832
+ " b'),': 701,\n",
833
+ " b' has': 702,\n",
834
+ " b'String': 703,\n",
835
+ " b'so': 704,\n",
836
+ " b' up': 705,\n",
837
+ " b'ax': 706,\n",
838
+ " b' def': 707,\n",
839
+ " b' bo': 708,\n",
840
+ " b'ge': 709,\n",
841
+ " b'alse': 710,\n",
842
+ " b'ON': 711,\n",
843
+ " b'per': 712,\n",
844
+ " b'ich': 713,\n",
845
+ " b' but': 714,\n",
846
+ " b' \\n': 715,\n",
847
+ " b' _': 716,\n",
848
+ " b'_m': 717,\n",
849
+ " b'add': 718,\n",
850
+ " b'quest': 719,\n",
851
+ " b'odel': 720,\n",
852
+ " b'self': 721,\n",
853
+ " b'ery': 722,\n",
854
+ " b'ft': 723,\n",
855
+ " b'ens': 724,\n",
856
+ " b'////': 725,\n",
857
+ " b'ake': 726,\n",
858
+ " b'.C': 727,\n",
859
+ " b' go': 728,\n",
860
+ " b' function': 729,\n",
861
+ " b' K': 730,\n",
862
+ " b'ivate': 731,\n",
863
+ " b' im': 732,\n",
864
+ " b' const': 733,\n",
865
+ " b'.t': 734,\n",
866
+ " b' */\\n': 735,\n",
867
+ " b');\\r\\n': 736,\n",
868
+ " b' void': 737,\n",
869
+ " b' set': 738,\n",
870
+ " b' System': 739,\n",
871
+ " b'cri': 740,\n",
872
+ " b'()\\n': 741,\n",
873
+ " b'li': 742,\n",
874
+ " b'\\tif': 743,\n",
875
+ " b'.m': 744,\n",
876
+ " b'ally': 745,\n",
877
+ " b'set': 746,\n",
878
+ " b'ep': 747,\n",
879
+ " b'\\xe2\\x80\\x99s': 748,\n",
880
+ " b'bo': 749,\n",
881
+ " b'def': 750,\n",
882
+ " b\"',\\n\": 751,\n",
883
+ " b' me': 752,\n",
884
+ " b' !': 753,\n",
885
+ " b'atch': 754,\n",
886
+ " b'\">': 755,\n",
887
+ " b'\",\\n': 756,\n",
888
+ " b'ec': 757,\n",
889
+ " b' In': 758,\n",
890
+ " b'ph': 759,\n",
891
+ " b' |': 760,\n",
892
+ " b'_f': 761,\n",
893
+ " b' var': 762,\n",
894
+ " b'ence': 763,\n",
895
+ " b'Id': 764,\n",
896
+ " b'ree': 765,\n",
897
+ " b'ink': 766,\n",
898
+ " b'lect': 767,\n",
899
+ " b'ug': 768,\n",
900
+ " b'eth': 769,\n",
901
+ " b' else': 770,\n",
902
+ " b'----------------': 771,\n",
903
+ " b'cont': 772,\n",
904
+ " b' so': 773,\n",
905
+ " b'atic': 774,\n",
906
+ " b' lo': 775,\n",
907
+ " b'pro': 776,\n",
908
+ " b'ton': 777,\n",
909
+ " b'ss': 778,\n",
910
+ " b'own': 779,\n",
911
+ " b'abel': 780,\n",
912
+ " b'oint': 781,\n",
913
+ " b'ous': 782,\n",
914
+ " b'eld': 783,\n",
915
+ " b'ST': 784,\n",
916
+ " b'The': 785,\n",
917
+ " b' ': 786,\n",
918
+ " b'RE': 787,\n",
919
+ " b'\":': 788,\n",
920
+ " b'olor': 789,\n",
921
+ " b'tp': 790,\n",
922
+ " b'eg': 791,\n",
923
+ " b'key': 792,\n",
924
+ " b'ude': 793,\n",
925
+ " b' St': 794,\n",
926
+ " b'ound': 795,\n",
927
+ " b' ar': 796,\n",
928
+ " b'\");\\n': 797,\n",
929
+ " b'ener': 798,\n",
930
+ " b'ser': 799,\n",
931
+ " b'bject': 800,\n",
932
+ " b'essage': 801,\n",
933
+ " b'fer': 802,\n",
934
+ " b' more': 803,\n",
935
+ " b'ations': 804,\n",
936
+ " b'ents': 805,\n",
937
+ " b' his': 806,\n",
938
+ " b' they': 807,\n",
939
+ " b'.S': 808,\n",
940
+ " b' Y': 809,\n",
941
+ " b'use': 810,\n",
942
+ " b'ne': 811,\n",
943
+ " b'ish': 812,\n",
944
+ " b'old': 813,\n",
945
+ " b'_d': 814,\n",
946
+ " b'io': 815,\n",
947
+ " b'ield': 816,\n",
948
+ " b' per': 817,\n",
949
+ " b'Cont': 818,\n",
950
+ " b'ings': 819,\n",
951
+ " b'####': 820,\n",
952
+ " b' data': 821,\n",
953
+ " b' sa': 822,\n",
954
+ " b'ef': 823,\n",
955
+ " b'fo': 824,\n",
956
+ " b' one': 825,\n",
957
+ " b'eng': 826,\n",
958
+ " b' dis': 827,\n",
959
+ " b'AT': 828,\n",
960
+ " b' name': 829,\n",
961
+ " b' true': 830,\n",
962
+ " b'val': 831,\n",
963
+ " b'led': 832,\n",
964
+ " b'.f': 833,\n",
965
+ " b' ne': 834,\n",
966
+ " b' end': 835,\n",
967
+ " b'.T': 836,\n",
968
+ " b'cre': 837,\n",
969
+ " b'ark': 838,\n",
970
+ " b'log': 839,\n",
971
+ " b'Ex': 840,\n",
972
+ " b'error': 841,\n",
973
+ " b'_id': 842,\n",
974
+ " b'urre': 843,\n",
975
+ " b'ange': 844,\n",
976
+ " b' null': 845,\n",
977
+ " b'rray': 846,\n",
978
+ " b' my': 847,\n",
979
+ " b'pan': 848,\n",
980
+ " b'ict': 849,\n",
981
+ " b'ator': 850,\n",
982
+ " b'View': 851,\n",
983
+ " b'List': 852,\n",
984
+ " b'\\treturn': 853,\n",
985
+ " b'\\xe2\\x80\\x9d': 854,\n",
986
+ " b' pre': 855,\n",
987
+ " b' x': 856,\n",
988
+ " b'clude': 857,\n",
989
+ " b'arg': 858,\n",
990
+ " b'ov': 859,\n",
991
+ " b'.h': 860,\n",
992
+ " b' >': 861,\n",
993
+ " b' their': 862,\n",
994
+ " b\"')\": 863,\n",
995
+ " b'irst': 864,\n",
996
+ " b'ick': 865,\n",
997
+ " b'gh': 866,\n",
998
+ " b'LE': 867,\n",
999
+ " b'OR': 868,\n",
1000
+ " b' private': 869,\n",
1001
+ " b'tem': 870,\n",
1002
+ " b'\\r\\n\\r\\n': 871,\n",
1003
+ " b'user': 872,\n",
1004
+ " b' )': 873,\n",
1005
+ " b'com': 874,\n",
1006
+ " b'.A': 875,\n",
1007
+ " b'\";\\n': 876,\n",
1008
+ " b' id': 877,\n",
1009
+ " b'read': 878,\n",
1010
+ " b' who': 879,\n",
1011
+ " b'_b': 880,\n",
1012
+ " b'\">\\n': 881,\n",
1013
+ " b' time': 882,\n",
1014
+ " b' man': 883,\n",
1015
+ " b'ry': 884,\n",
1016
+ " b'========': 885,\n",
1017
+ " b'roup': 886,\n",
1018
+ " b'rop': 887,\n",
1019
+ " b'public': 888,\n",
1020
+ " b'vel': 889,\n",
1021
+ " b'umber': 890,\n",
1022
+ " b'ble': 891,\n",
1023
+ " b' which': 892,\n",
1024
+ " b'****************': 893,\n",
1025
+ " b' any': 894,\n",
1026
+ " b' false': 895,\n",
1027
+ " b'we': 896,\n",
1028
+ " b' value': 897,\n",
1029
+ " b' li': 898,\n",
1030
+ " b'\")': 899,\n",
1031
+ " b'nder': 900,\n",
1032
+ " b'gr': 901,\n",
1033
+ " b' no': 902,\n",
1034
+ " b'param': 903,\n",
1035
+ " b'fig': 904,\n",
1036
+ " b'.com': 905,\n",
1037
+ " b' app': 906,\n",
1038
+ " b'_l': 907,\n",
1039
+ " b'ions': 908,\n",
1040
+ " b'.D': 909,\n",
1041
+ " b' Ch': 910,\n",
1042
+ " b' about': 911,\n",
1043
+ " b' add': 912,\n",
1044
+ " b' su': 913,\n",
1045
+ " b' string': 914,\n",
1046
+ " b'ID': 915,\n",
1047
+ " b' over': 916,\n",
1048
+ " b'string': 917,\n",
1049
+ " b'.l': 918,\n",
1050
+ " b'ource': 919,\n",
1051
+ " b'_C': 920,\n",
1052
+ " b']\\n': 921,\n",
1053
+ " b' qu': 922,\n",
1054
+ " b' String': 923,\n",
1055
+ " b'ca': 924,\n",
1056
+ " b'SE': 925,\n",
1057
+ " b' ro': 926,\n",
1058
+ " b'sh': 927,\n",
1059
+ " b'ual': 928,\n",
1060
+ " b'Type': 929,\n",
1061
+ " b'son': 930,\n",
1062
+ " b'new': 931,\n",
1063
+ " b'ern': 932,\n",
1064
+ " b' ag': 933,\n",
1065
+ " b'AR': 934,\n",
1066
+ " b'];\\n': 935,\n",
1067
+ " b'].': 936,\n",
1068
+ " b' ?': 937,\n",
1069
+ " b'ical': 938,\n",
1070
+ " b' des': 939,\n",
1071
+ " b'uth': 940,\n",
1072
+ " b'ix': 941,\n",
1073
+ " b'ays': 942,\n",
1074
+ " b' type': 943,\n",
1075
+ " b\"'t\": 944,\n",
1076
+ " b'ault': 945,\n",
1077
+ " b' inter': 946,\n",
1078
+ " b'var': 947,\n",
1079
+ " b'.b': 948,\n",
1080
+ " b' part': 949,\n",
1081
+ " b'.d': 950,\n",
1082
+ " b'urrent': 951,\n",
1083
+ " b'IT': 952,\n",
1084
+ " b'EN': 953,\n",
1085
+ " b'enc': 954,\n",
1086
+ " b'(f': 955,\n",
1087
+ " b'ra': 956,\n",
1088
+ " b'value': 957,\n",
1089
+ " b'cho': 958,\n",
1090
+ " b'utton': 959,\n",
1091
+ " b'ose': 960,\n",
1092
+ " b' !=': 961,\n",
1093
+ " b'ater': 962,\n",
1094
+ " b'\\xc3\\xa9': 963,\n",
1095
+ " b'reate': 964,\n",
1096
+ " b'oll': 965,\n",
1097
+ " b'pos': 966,\n",
1098
+ " b'yle': 967,\n",
1099
+ " b'ng': 968,\n",
1100
+ " b'AL': 969,\n",
1101
+ " b'using': 970,\n",
1102
+ " b'ames': 971,\n",
1103
+ " b' {\\r\\n': 972,\n",
1104
+ " b'ates': 973,\n",
1105
+ " b'ely': 974,\n",
1106
+ " b' work': 975,\n",
1107
+ " b' em': 976,\n",
1108
+ " b'inal': 977,\n",
1109
+ " b' sp': 978,\n",
1110
+ " b' when': 979,\n",
1111
+ " b'.set': 980,\n",
1112
+ " b' ': 981,\n",
1113
+ " b'):\\n': 982,\n",
1114
+ " b'to': 983,\n",
1115
+ " b'quire': 984,\n",
1116
+ " b'indow': 985,\n",
1117
+ " b'lement': 986,\n",
1118
+ " b'pect': 987,\n",
1119
+ " b'ash': 988,\n",
1120
+ " b'[i': 989,\n",
1121
+ " b' use': 990,\n",
1122
+ " b'.F': 991,\n",
1123
+ " b'pec': 992,\n",
1124
+ " b' ad': 993,\n",
1125
+ " b'ove': 994,\n",
1126
+ " b'ception': 995,\n",
1127
+ " b'ength': 996,\n",
1128
+ " b'include': 997,\n",
1129
+ " b'ader': 998,\n",
1130
+ " b' ': 999,\n",
1131
+ " ...}"
1132
+ ]
1133
+ },
1134
+ "execution_count": 41,
1135
+ "metadata": {},
1136
+ "output_type": "execute_result"
1137
+ }
1138
+ ],
1139
+ "source": [
1140
+ "tokenizer.get_vocab()"
1141
+ ]
1142
+ },
1143
+ {
1144
+ "cell_type": "code",
1145
+ "execution_count": 42,
1146
+ "id": "8ea45811-b04f-461e-980e-7f7c6aa8e93c",
1147
+ "metadata": {},
1148
+ "outputs": [
1149
+ {
1150
+ "data": {
1151
+ "text/plain": [
1152
+ "[1350, 492, 151643, 863, 151643]"
1153
+ ]
1154
+ },
1155
+ "execution_count": 42,
1156
+ "metadata": {},
1157
+ "output_type": "execute_result"
1158
+ }
1159
+ ],
1160
+ "source": [
1161
+ "tokenizer.encode(\"print('<|endoftext|>')<|endoftext|>\")"
1162
+ ]
1163
+ },
1164
+ {
1165
+ "cell_type": "code",
1166
+ "execution_count": 43,
1167
+ "id": "58f6bd7f-2162-4d76-8994-eea7c0a9b367",
1168
+ "metadata": {},
1169
+ "outputs": [
1170
+ {
1171
+ "data": {
1172
+ "text/plain": [
1173
+ "'print'"
1174
+ ]
1175
+ },
1176
+ "execution_count": 43,
1177
+ "metadata": {},
1178
+ "output_type": "execute_result"
1179
+ }
1180
+ ],
1181
+ "source": [
1182
+ "tokenizer.decode([1350])"
1183
+ ]
1184
+ },
1185
+ {
1186
+ "cell_type": "code",
1187
+ "execution_count": 44,
1188
+ "id": "b655549f-f089-49c7-abff-6335a48ca117",
1189
+ "metadata": {},
1190
+ "outputs": [
1191
+ {
1192
+ "data": {
1193
+ "text/plain": [
1194
+ "\"print('<|endoftext|>')<|endoftext|>\""
1195
+ ]
1196
+ },
1197
+ "execution_count": 44,
1198
+ "metadata": {},
1199
+ "output_type": "execute_result"
1200
+ }
1201
+ ],
1202
+ "source": [
1203
+ "tokenizer.decode([1350, 492, 151643, 863, 151643])"
1204
+ ]
1205
+ },
1206
+ {
1207
+ "cell_type": "code",
1208
+ "execution_count": 45,
1209
+ "id": "11e44fcd-f707-4cb6-aa50-0d8d6d69c7e9",
1210
+ "metadata": {},
1211
+ "outputs": [
1212
+ {
1213
+ "data": {
1214
+ "text/plain": [
1215
+ "151643"
1216
+ ]
1217
+ },
1218
+ "execution_count": 45,
1219
+ "metadata": {},
1220
+ "output_type": "execute_result"
1221
+ }
1222
+ ],
1223
+ "source": [
1224
+ "tokenizer.eod_id"
1225
+ ]
1226
+ },
1227
+ {
1228
+ "cell_type": "code",
1229
+ "execution_count": 46,
1230
+ "id": "69277c19-171b-4d69-902c-738b3176aed6",
1231
+ "metadata": {},
1232
+ "outputs": [
1233
+ {
1234
+ "data": {
1235
+ "text/plain": [
1236
+ "[1350, 11146, 91, 8691, 723, 427, 91, 79865, 151643]"
1237
+ ]
1238
+ },
1239
+ "execution_count": 46,
1240
+ "metadata": {},
1241
+ "output_type": "execute_result"
1242
+ }
1243
+ ],
1244
+ "source": [
1245
+ "tokenizer.encode(\"print('<|endoftext|>')\", allowed_special=set(), disallowed_special=()) + [tokenizer.eod_id]"
1246
+ ]
1247
+ },
1248
+ {
1249
+ "cell_type": "code",
1250
+ "execution_count": 47,
1251
+ "id": "692bd4b6-8098-4596-b2bb-81bfb04f1fa0",
1252
+ "metadata": {},
1253
+ "outputs": [
1254
+ {
1255
+ "data": {
1256
+ "text/plain": [
1257
+ "'<|endoftext|>'"
1258
+ ]
1259
+ },
1260
+ "execution_count": 47,
1261
+ "metadata": {},
1262
+ "output_type": "execute_result"
1263
+ }
1264
+ ],
1265
+ "source": [
1266
+ "tokenizer.decode([151643])"
1267
+ ]
1268
+ },
1269
+ {
1270
+ "cell_type": "code",
1271
+ "execution_count": 48,
1272
+ "id": "49e51cb0-8242-489b-ace2-fc955e75844f",
1273
+ "metadata": {},
1274
+ "outputs": [
1275
+ {
1276
+ "data": {
1277
+ "text/plain": [
1278
+ "'|endoftext|'"
1279
+ ]
1280
+ },
1281
+ "execution_count": 48,
1282
+ "metadata": {},
1283
+ "output_type": "execute_result"
1284
+ }
1285
+ ],
1286
+ "source": [
1287
+ "tokenizer.decode([91, 8691, 723, 427, 91])"
1288
+ ]
1289
+ },
1290
+ {
1291
+ "cell_type": "code",
1292
+ "execution_count": 49,
1293
+ "id": "fd732a3d-046c-42a5-9edc-636f324a44aa",
1294
+ "metadata": {},
1295
+ "outputs": [
1296
+ {
1297
+ "data": {
1298
+ "text/plain": [
1299
+ "\"('<\""
1300
+ ]
1301
+ },
1302
+ "execution_count": 49,
1303
+ "metadata": {},
1304
+ "output_type": "execute_result"
1305
+ }
1306
+ ],
1307
+ "source": [
1308
+ "tokenizer.decode([11146])"
1309
+ ]
1310
+ },
1311
+ {
1312
+ "cell_type": "markdown",
1313
+ "id": "f3cab603-279b-48e6-b8ac-7ee82da543a5",
1314
+ "metadata": {},
1315
+ "source": [
1316
+ "**disallow emerge special tokens**"
1317
+ ]
1318
+ },
1319
+ {
1320
+ "cell_type": "code",
1321
+ "execution_count": 51,
1322
+ "id": "72c2b0c0-38b0-41b0-9c4f-5246de72b07f",
1323
+ "metadata": {},
1324
+ "outputs": [
1325
+ {
1326
+ "ename": "ValueError",
1327
+ "evalue": "Encountered text corresponding to disallowed special token '<|endoftext|>'.\nIf you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.\nIf you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.\nTo disable this check for all special tokens, pass `disallowed_special=()`.\n",
1328
+ "output_type": "error",
1329
+ "traceback": [
1330
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1331
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
1332
+ "Cell \u001b[0;32mIn[51], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mprint(\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m<|endoftext|>\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m)\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mallowed_special\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mset\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdisallowed_special\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mall\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;241m+\u001b[39m [tokenizer\u001b[38;5;241m.\u001b[39meod_id]\n",
1333
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2373\u001b[0m, in \u001b[0;36mPreTrainedTokenizerBase.encode\u001b[0;34m(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)\u001b[0m\n\u001b[1;32m 2336\u001b[0m \u001b[38;5;129m@add_end_docstrings\u001b[39m(\n\u001b[1;32m 2337\u001b[0m ENCODE_KWARGS_DOCSTRING,\n\u001b[1;32m 2338\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2356\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[1;32m 2357\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m List[\u001b[38;5;28mint\u001b[39m]:\n\u001b[1;32m 2358\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 2359\u001b[0m \u001b[38;5;124;03m Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.\u001b[39;00m\n\u001b[1;32m 2360\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2371\u001b[0m \u001b[38;5;124;03m method).\u001b[39;00m\n\u001b[1;32m 2372\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m-> 2373\u001b[0m encoded_inputs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode_plus\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2374\u001b[0m \u001b[43m \u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2375\u001b[0m \u001b[43m \u001b[49m\u001b[43mtext_pair\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtext_pair\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2376\u001b[0m \u001b[43m \u001b[49m\u001b[43madd_special_tokens\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43madd_special_tokens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2377\u001b[0m \u001b[43m \u001b[49m\u001b[43mpadding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpadding\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2378\u001b[0m \u001b[43m \u001b[49m\u001b[43mtruncation\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtruncation\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2379\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_length\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2380\u001b[0m \u001b[43m \u001b[49m\u001b[43mstride\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstride\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2381\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_tensors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_tensors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2382\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2383\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2385\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m encoded_inputs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124minput_ids\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
1334
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2781\u001b[0m, in \u001b[0;36mPreTrainedTokenizerBase.encode_plus\u001b[0;34m(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)\u001b[0m\n\u001b[1;32m 2771\u001b[0m \u001b[38;5;66;03m# Backward compatibility for 'truncation_strategy', 'pad_to_max_length'\u001b[39;00m\n\u001b[1;32m 2772\u001b[0m padding_strategy, truncation_strategy, max_length, kwargs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_padding_truncation_strategies(\n\u001b[1;32m 2773\u001b[0m padding\u001b[38;5;241m=\u001b[39mpadding,\n\u001b[1;32m 2774\u001b[0m truncation\u001b[38;5;241m=\u001b[39mtruncation,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2778\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[1;32m 2779\u001b[0m )\n\u001b[0;32m-> 2781\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_encode_plus\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 2782\u001b[0m \u001b[43m \u001b[49m\u001b[43mtext\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2783\u001b[0m \u001b[43m \u001b[49m\u001b[43mtext_pair\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtext_pair\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2784\u001b[0m \u001b[43m \u001b[49m\u001b[43madd_special_tokens\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43madd_special_tokens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2785\u001b[0m \u001b[43m \u001b[49m\u001b[43mpadding_strategy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpadding_strategy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2786\u001b[0m \u001b[43m \u001b[49m\u001b[43mtruncation_strategy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtruncation_strategy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2787\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_length\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmax_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2788\u001b[0m \u001b[43m \u001b[49m\u001b[43mstride\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstride\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2789\u001b[0m \u001b[43m \u001b[49m\u001b[43mis_split_into_words\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mis_split_into_words\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2790\u001b[0m \u001b[43m \u001b[49m\u001b[43mpad_to_multiple_of\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpad_to_multiple_of\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2791\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_tensors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_tensors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2792\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_token_type_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_token_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2793\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_attention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_attention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2794\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_overflowing_tokens\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_overflowing_tokens\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2795\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_special_tokens_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_special_tokens_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2796\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_offsets_mapping\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_offsets_mapping\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2797\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_length\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2798\u001b[0m \u001b[43m \u001b[49m\u001b[43mverbose\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mverbose\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2799\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 2800\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
1335
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/transformers/tokenization_utils.py:656\u001b[0m, in \u001b[0;36mPreTrainedTokenizer._encode_plus\u001b[0;34m(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)\u001b[0m\n\u001b[1;32m 647\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m return_offsets_mapping:\n\u001b[1;32m 648\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mNotImplementedError\u001b[39;00m(\n\u001b[1;32m 649\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mreturn_offset_mapping is not available when using Python tokenizers. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 650\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTo use this feature, change your tokenizer to one deriving from \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 653\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhttps://github.com/huggingface/transformers/pull/2674\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 654\u001b[0m )\n\u001b[0;32m--> 656\u001b[0m first_ids \u001b[38;5;241m=\u001b[39m \u001b[43mget_input_ids\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 657\u001b[0m second_ids \u001b[38;5;241m=\u001b[39m get_input_ids(text_pair) \u001b[38;5;28;01mif\u001b[39;00m text_pair \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 659\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprepare_for_model(\n\u001b[1;32m 660\u001b[0m first_ids,\n\u001b[1;32m 661\u001b[0m pair_ids\u001b[38;5;241m=\u001b[39msecond_ids,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 675\u001b[0m verbose\u001b[38;5;241m=\u001b[39mverbose,\n\u001b[1;32m 676\u001b[0m )\n",
1336
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/transformers/tokenization_utils.py:623\u001b[0m, in \u001b[0;36mPreTrainedTokenizer._encode_plus.<locals>.get_input_ids\u001b[0;34m(text)\u001b[0m\n\u001b[1;32m 621\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mget_input_ids\u001b[39m(text):\n\u001b[1;32m 622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(text, \u001b[38;5;28mstr\u001b[39m):\n\u001b[0;32m--> 623\u001b[0m tokens \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtokenize\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 624\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconvert_tokens_to_ids(tokens)\n\u001b[1;32m 625\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(text, (\u001b[38;5;28mlist\u001b[39m, \u001b[38;5;28mtuple\u001b[39m)) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(text) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m0\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(text[\u001b[38;5;241m0\u001b[39m], \u001b[38;5;28mstr\u001b[39m):\n",
1337
+ "File \u001b[0;32m~/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B/4792686b9af1b3663a02f39bc44f37326f4f30f4/tokenization_qwen.py:182\u001b[0m, in \u001b[0;36mQWenTokenizer.tokenize\u001b[0;34m(self, text, allowed_special, disallowed_special, **kwargs)\u001b[0m\n\u001b[1;32m 179\u001b[0m text \u001b[38;5;241m=\u001b[39m unicodedata\u001b[38;5;241m.\u001b[39mnormalize(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNFC\u001b[39m\u001b[38;5;124m\"\u001b[39m, text)\n\u001b[1;32m 181\u001b[0m \u001b[38;5;66;03m# this implementation takes a detour: text -> token id -> token surface forms\u001b[39;00m\n\u001b[0;32m--> 182\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m t \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 183\u001b[0m \u001b[43m \u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mallowed_special\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mallowed_special\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdisallowed_special\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdisallowed_special\u001b[49m\n\u001b[1;32m 184\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m:\n\u001b[1;32m 185\u001b[0m tokens\u001b[38;5;241m.\u001b[39mappend(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdecoder[t])\n\u001b[1;32m 186\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m tokens\n",
1338
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/tiktoken/core.py:117\u001b[0m, in \u001b[0;36mEncoding.encode\u001b[0;34m(self, text, allowed_special, disallowed_special)\u001b[0m\n\u001b[1;32m 115\u001b[0m disallowed_special \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mfrozenset\u001b[39m(disallowed_special)\n\u001b[1;32m 116\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m match \u001b[38;5;241m:=\u001b[39m _special_token_regex(disallowed_special)\u001b[38;5;241m.\u001b[39msearch(text):\n\u001b[0;32m--> 117\u001b[0m \u001b[43mraise_disallowed_special_token\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmatch\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 119\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 120\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_core_bpe\u001b[38;5;241m.\u001b[39mencode(text, allowed_special)\n",
1339
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/tiktoken/core.py:351\u001b[0m, in \u001b[0;36mraise_disallowed_special_token\u001b[0;34m(token)\u001b[0m\n\u001b[1;32m 350\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mraise_disallowed_special_token\u001b[39m(token: \u001b[38;5;28mstr\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m NoReturn:\n\u001b[0;32m--> 351\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 352\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mEncountered text corresponding to disallowed special token \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtoken\u001b[38;5;132;01m!r}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 353\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIf you want this text to be encoded as a special token, \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 354\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpass it to `allowed_special`, e.g. `allowed_special=\u001b[39m\u001b[38;5;130;01m{{\u001b[39;00m\u001b[38;5;132;01m{\u001b[39;00mtoken\u001b[38;5;132;01m!r}\u001b[39;00m\u001b[38;5;124m, ...\u001b[39m\u001b[38;5;130;01m}}\u001b[39;00m\u001b[38;5;124m`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 355\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIf you want this text to be encoded as normal text, disable the check for this token \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 356\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mby passing `disallowed_special=(enc.special_tokens_set - \u001b[39m\u001b[38;5;130;01m{{\u001b[39;00m\u001b[38;5;132;01m{\u001b[39;00mtoken\u001b[38;5;132;01m!r}\u001b[39;00m\u001b[38;5;130;01m}}\u001b[39;00m\u001b[38;5;124m)`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 357\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTo disable this check for all special tokens, pass `disallowed_special=()`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 358\u001b[0m )\n",
1340
+ "\u001b[0;31mValueError\u001b[0m: Encountered text corresponding to disallowed special token '<|endoftext|>'.\nIf you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.\nIf you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.\nTo disable this check for all special tokens, pass `disallowed_special=()`.\n"
1341
+ ]
1342
+ }
1343
+ ],
1344
+ "source": [
1345
+ "tokenizer.encode(\"print('<|endoftext|>')\", allowed_special=set(), disallowed_special='all') + [tokenizer.eod_id]"
1346
+ ]
1347
+ },
1348
+ {
1349
+ "cell_type": "code",
1350
+ "execution_count": 52,
1351
+ "id": "d730bd02-d07f-42e8-8bf1-d311e6337061",
1352
+ "metadata": {},
1353
+ "outputs": [
1354
+ {
1355
+ "name": "stderr",
1356
+ "output_type": "stream",
1357
+ "text": [
1358
+ "Using unk_token, but it is not set yet.\n"
1359
+ ]
1360
+ }
1361
+ ],
1362
+ "source": [
1363
+ "tokenizer.unk_token"
1364
+ ]
1365
+ },
1366
+ {
1367
+ "cell_type": "code",
1368
+ "execution_count": 58,
1369
+ "id": "e1d658dd-03d4-4bb2-8947-e6d84f654d6f",
1370
+ "metadata": {},
1371
+ "outputs": [
1372
+ {
1373
+ "data": {
1374
+ "text/plain": [
1375
+ "NoneType"
1376
+ ]
1377
+ },
1378
+ "execution_count": 58,
1379
+ "metadata": {},
1380
+ "output_type": "execute_result"
1381
+ }
1382
+ ],
1383
+ "source": [
1384
+ "type(tokenizer.pad_token_id)\n",
1385
+ "#tokenizer._convert_id_to_token(tokenizer.pad_token_id)"
1386
+ ]
1387
+ },
1388
+ {
1389
+ "cell_type": "code",
1390
+ "execution_count": 54,
1391
+ "id": "139d66d9-017a-40d4-8b2d-919a78439ddf",
1392
+ "metadata": {},
1393
+ "outputs": [
1394
+ {
1395
+ "data": {
1396
+ "text/plain": [
1397
+ "151646"
1398
+ ]
1399
+ },
1400
+ "execution_count": 54,
1401
+ "metadata": {},
1402
+ "output_type": "execute_result"
1403
+ }
1404
+ ],
1405
+ "source": [
1406
+ "tokenizer.special_tokens['<|extra_0|>']\n",
1407
+ "#tokenizer.special_tokens['<|extra_204|>']"
1408
+ ]
1409
+ },
1410
+ {
1411
+ "cell_type": "markdown",
1412
+ "id": "15efb8c0-073e-4f9a-974c-9709f705e5f3",
1413
+ "metadata": {},
1414
+ "source": [
1415
+ "#abc"
1416
+ ]
1417
+ },
1418
+ {
1419
+ "cell_type": "code",
1420
+ "execution_count": null,
1421
+ "id": "b714951c-745e-41dd-86d2-89e28862e21c",
1422
+ "metadata": {},
1423
+ "outputs": [],
1424
+ "source": [
1425
+ "ids = [1350, 11146, 91, 8691, 723, 427, 91, 79865, 151643]\n",
1426
+ "tokenizer.convert_ids_to_tokens(ids)"
1427
+ ]
1428
+ },
1429
+ {
1430
+ "cell_type": "code",
1431
+ "execution_count": 78,
1432
+ "id": "0c217322-b24b-4803-bb4d-fb9c675fef5f",
1433
+ "metadata": {},
1434
+ "outputs": [
1435
+ {
1436
+ "data": {
1437
+ "text/plain": [
1438
+ "[b' ']"
1439
+ ]
1440
+ },
1441
+ "execution_count": 78,
1442
+ "metadata": {},
1443
+ "output_type": "execute_result"
1444
+ }
1445
+ ],
1446
+ "source": [
1447
+ "ids = tokenizer.encode(\" \")\n",
1448
+ "tokenizer.convert_ids_to_tokens(ids)"
1449
+ ]
1450
+ },
1451
+ {
1452
+ "cell_type": "code",
1453
+ "execution_count": 101,
1454
+ "id": "1ac63768-bb7c-45de-ba95-ca918db5f806",
1455
+ "metadata": {},
1456
+ "outputs": [
1457
+ {
1458
+ "name": "stdout",
1459
+ "output_type": "stream",
1460
+ "text": [
1461
+ "ids: [151644, 1350, 492, 35946, 99639, 91680, 100472, 151646, 1305, 2, 116198, 116198, 116198, 13, 10236, 226, 114, 151645, 151643]\n"
1462
+ ]
1463
+ }
1464
+ ],
1465
+ "source": [
1466
+ "ids = tokenizer.encode(\"<|im_start|>print('我是一只猫<|extra_0|>')\\n#喵喵喵. 然<|im_end|>\", \n",
1467
+ " allowed_special={'<|im_start|>', '<|im_end|>', '<|extra_0|>'}, \n",
1468
+ " disallowed_special={'<|endoftext|>'}) + [tokenizer.eod_id]\n",
1469
+ "print(\"ids:\", ids)"
1470
+ ]
1471
+ },
1472
+ {
1473
+ "cell_type": "code",
1474
+ "execution_count": 102,
1475
+ "id": "47530127-7f66-4b2c-9396-10b672ef9c4c",
1476
+ "metadata": {},
1477
+ "outputs": [
1478
+ {
1479
+ "data": {
1480
+ "text/plain": [
1481
+ "['<|im_start|>',\n",
1482
+ " b'print',\n",
1483
+ " b\"('\",\n",
1484
+ " b'\\xe6\\x88\\x91',\n",
1485
+ " b'\\xe6\\x98\\xaf\\xe4\\xb8\\x80',\n",
1486
+ " b'\\xe5\\x8f\\xaa',\n",
1487
+ " b'\\xe7\\x8c\\xab',\n",
1488
+ " '<|extra_0|>',\n",
1489
+ " b\"')\\n\",\n",
1490
+ " b'#',\n",
1491
+ " b'\\xe5\\x96\\xb5',\n",
1492
+ " b'\\xe5\\x96\\xb5',\n",
1493
+ " b'\\xe5\\x96\\xb5',\n",
1494
+ " b'.',\n",
1495
+ " b' \\xe7',\n",
1496
+ " b'\\x84',\n",
1497
+ " b'\\xb6',\n",
1498
+ " '<|im_end|>',\n",
1499
+ " '<|endoftext|>']"
1500
+ ]
1501
+ },
1502
+ "execution_count": 102,
1503
+ "metadata": {},
1504
+ "output_type": "execute_result"
1505
+ }
1506
+ ],
1507
+ "source": [
1508
+ "tokenizer.convert_ids_to_tokens(ids)\n",
1509
+ "##bytes to string\n",
1510
+ "\n",
1511
+ "# for t in tokens:\n",
1512
+ "# if isinstance(t, bytes):\n",
1513
+ "# try:\n",
1514
+ "# t = t.decode('utf-8')\n",
1515
+ "# except:\n",
1516
+ "# print(\"*\", t)\n",
1517
+ "# t = t.decode('iso-8859-1')\n",
1518
+ "# print(t)\n",
1519
+ " "
1520
+ ]
1521
+ },
1522
+ {
1523
+ "cell_type": "code",
1524
+ "execution_count": null,
1525
+ "id": "c9a655c8-cf35-4209-9815-dd077fdfe6d7",
1526
+ "metadata": {},
1527
+ "outputs": [],
1528
+ "source": [
1529
+ "tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(ids))"
1530
+ ]
1531
+ },
1532
+ {
1533
+ "cell_type": "code",
1534
+ "execution_count": 64,
1535
+ "id": "26664e1a-192f-499f-9411-d6639d302b73",
1536
+ "metadata": {},
1537
+ "outputs": [
1538
+ {
1539
+ "data": {
1540
+ "text/plain": [
1541
+ "'是一'"
1542
+ ]
1543
+ },
1544
+ "execution_count": 64,
1545
+ "metadata": {},
1546
+ "output_type": "execute_result"
1547
+ }
1548
+ ],
1549
+ "source": [
1550
+ "bbytes = b'\\xe6\\x98\\xaf\\xe4\\xb8\\x80'\n",
1551
+ "bbytes.decode('utf-8')"
1552
+ ]
1553
+ },
1554
+ {
1555
+ "cell_type": "code",
1556
+ "execution_count": null,
1557
+ "id": "0a50927d-0a26-4083-976a-3ec600405f7b",
1558
+ "metadata": {},
1559
+ "outputs": [],
1560
+ "source": []
1561
+ }
1562
+ ],
1563
+ "metadata": {
1564
+ "kernelspec": {
1565
+ "display_name": "Python 3 (ipykernel)",
1566
+ "language": "python",
1567
+ "name": "python3"
1568
+ },
1569
+ "language_info": {
1570
+ "codemirror_mode": {
1571
+ "name": "ipython",
1572
+ "version": 3
1573
+ },
1574
+ "file_extension": ".py",
1575
+ "mimetype": "text/x-python",
1576
+ "name": "python",
1577
+ "nbconvert_exporter": "python",
1578
+ "pygments_lexer": "ipython3",
1579
+ "version": "3.10.6"
1580
+ }
1581
+ },
1582
+ "nbformat": 4,
1583
+ "nbformat_minor": 5
1584
+ }
tiktoken_test.ipynb ADDED
The diff for this file is too large to render. See raw diff