dotan1111 commited on
Commit
b516756
1 Parent(s): 95c6581

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +52 -0
  2. tokenizer.json +1630 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - biology
4
+ - bioinformatics
5
+ - tokenizers
6
+ ---
7
+ # Effect of Tokenization on Transformers for Biological Sequences
8
+ ## Abstract:
9
+ Deep learning models are transforming biological research. Many bioinformatics and comparative genomics algorithms analyze genomic data, either DNA or protein sequences. Examples include sequence alignments, phylogenetic tree inference and automatic classification of protein functions. Among these deep learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different than natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.
10
+
11
+ ![image](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/d69893e2-7114-41a8-8d46-9b025b2d2840)
12
+
13
+ Different tokenization algorithms can be applied to biological sequences, as exemplified for the sequence “AAGTCAAGGATC”. (a) The baseline “words” tokenizer assumes a dictionary consisting of the nucleotides: “A”, “C”, “G” and “T”. The length of the encoded sequence is 12, i.e., the number of nucleotides; (b) The “pairs” tokenizer assumes a dictionary consisting of all possible nucleotide pairs. The length of the encoded sequences is typically halved; (c) A sophisticated dictionary consisting of only three tokens: “AAG”, “TC” and “GA”. The encoded sequence for this dictionary contains only five tokens.
14
+
15
+ ## Data:
16
+ The "data" folder contains the train, valid and test data of seven of the eight datasets used in the paper.
17
+
18
+ ## BFD Tokenizers:
19
+
20
+ We trained BPE, WordPiece and Unigram tokenizers on samples of proteins from the 2.2 billion protein sequences of the BFD dataset (Steinegger and Söding 2018). We evaluate the average sequences length as a function of the vocabulary size and number of sequences in the training data.
21
+
22
+ ![BFD_BPE_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/710b7aa7-0dde-46bb-9ddf-39a84b579d71)
23
+ ![BFD_WPC_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/8adfe5a7-25f5-4723-a87a-8598c6a76ff6)
24
+ ![BFD_UNI_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/4462e782-0b21-4377-a5fe-309685141538)
25
+
26
+ Effect of vocabulary size and number of training samples on the three tokenizers: BPE, WordPiece and Unigram. The darker the color the higher the average number of tokens per protein. Increasing the vocabulary and the training size reduces the number of tokens per protein for all of the tested tokenizers.
27
+
28
+ We uploaded the "BFD_Tokenizers" which been trained on 10,000,000 sequences randomly sampled from the BFD datasset.
29
+
30
+ ## Github
31
+
32
+ The code, datasets and trained tokenizers are available on https://github.com/idotan286/BiologicalTokenizers/.
33
+
34
+ ## APA
35
+
36
+ ```
37
+ Dotan, E., Jaschek, G., Pupko, T., & Belinkov, Y. (2023). Effect of Tokenization on Transformers for Biological Sequences. bioRxiv. https://doi.org/10.1101/2023.08.15.553415
38
+ ```
39
+
40
+
41
+ ## BibTeX
42
+ ```
43
+ @article{Dotan_Effect_of_Tokenization_2023,
44
+ author = {Dotan, Edo and Jaschek, Gal and Pupko, Tal and Belinkov, Yonatan},
45
+ doi = {10.1101/2023.08.15.553415},
46
+ journal = {bioRxiv},
47
+ month = aug,
48
+ title = {{Effect of Tokenization on Transformers for Biological Sequences}},
49
+ year = {2023}
50
+ }
51
+
52
+ ```
tokenizer.json ADDED
@@ -0,0 +1,1630 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "<UNK>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ }
15
+ ],
16
+ "normalizer": {
17
+ "type": "Lowercase"
18
+ },
19
+ "pre_tokenizer": {
20
+ "type": "Whitespace"
21
+ },
22
+ "post_processor": null,
23
+ "decoder": null,
24
+ "model": {
25
+ "type": "Unigram",
26
+ "unk_id": 0,
27
+ "vocab": [
28
+ [
29
+ "<UNK>",
30
+ 0.0
31
+ ],
32
+ [
33
+ "a",
34
+ -3.632944090450515
35
+ ],
36
+ [
37
+ "l",
38
+ -3.7068570202096325
39
+ ],
40
+ [
41
+ "g",
42
+ -3.9042950954442333
43
+ ],
44
+ [
45
+ "r",
46
+ -3.9545991750240184
47
+ ],
48
+ [
49
+ "v",
50
+ -3.998234837926354
51
+ ],
52
+ [
53
+ "s",
54
+ -4.14952275224373
55
+ ],
56
+ [
57
+ "t",
58
+ -4.287456837510817
59
+ ],
60
+ [
61
+ "e",
62
+ -4.289392028132756
63
+ ],
64
+ [
65
+ "p",
66
+ -4.290234200036018
67
+ ],
68
+ [
69
+ "d",
70
+ -4.3477089813135485
71
+ ],
72
+ [
73
+ "i",
74
+ -4.476037236923364
75
+ ],
76
+ [
77
+ "m",
78
+ -4.510006645253128
79
+ ],
80
+ [
81
+ "k",
82
+ -4.708311207049434
83
+ ],
84
+ [
85
+ "q",
86
+ -4.735006170352175
87
+ ],
88
+ [
89
+ "f",
90
+ -4.789893467104555
91
+ ],
92
+ [
93
+ "h",
94
+ -5.0101739158699665
95
+ ],
96
+ [
97
+ "n",
98
+ -5.01778111762699
99
+ ],
100
+ [
101
+ "aa",
102
+ -5.022472558408127
103
+ ],
104
+ [
105
+ "rr",
106
+ -5.128207662147673
107
+ ],
108
+ [
109
+ "c",
110
+ -5.145198760398738
111
+ ],
112
+ [
113
+ "al",
114
+ -5.180538576122947
115
+ ],
116
+ [
117
+ "ll",
118
+ -5.201140477172862
119
+ ],
120
+ [
121
+ "y",
122
+ -5.2091061310437485
123
+ ],
124
+ [
125
+ "av",
126
+ -5.23780321892715
127
+ ],
128
+ [
129
+ "ag",
130
+ -5.2607155787369955
131
+ ],
132
+ [
133
+ "rl",
134
+ -5.269063013646287
135
+ ],
136
+ [
137
+ "lv",
138
+ -5.280324374261429
139
+ ],
140
+ [
141
+ "vl",
142
+ -5.2821850373789125
143
+ ],
144
+ [
145
+ "la",
146
+ -5.287911107396141
147
+ ],
148
+ [
149
+ "gg",
150
+ -5.2907629902747395
151
+ ],
152
+ [
153
+ "lr",
154
+ -5.296613327734477
155
+ ],
156
+ [
157
+ "ar",
158
+ -5.317839521134866
159
+ ],
160
+ [
161
+ "w",
162
+ -5.3283983119940075
163
+ ],
164
+ [
165
+ "ga",
166
+ -5.342320152037463
167
+ ],
168
+ [
169
+ "lg",
170
+ -5.354316588369922
171
+ ],
172
+ [
173
+ "gr",
174
+ -5.36421754910026
175
+ ],
176
+ [
177
+ "ls",
178
+ -5.365998834295857
179
+ ],
180
+ [
181
+ "ra",
182
+ -5.368576042445021
183
+ ],
184
+ [
185
+ "el",
186
+ -5.3742851295036225
187
+ ],
188
+ [
189
+ "sa",
190
+ -5.3767074957757295
191
+ ],
192
+ [
193
+ "pa",
194
+ -5.38691039440727
195
+ ],
196
+ [
197
+ "va",
198
+ -5.391132319944489
199
+ ],
200
+ [
201
+ "sl",
202
+ -5.391970768048152
203
+ ],
204
+ [
205
+ "ta",
206
+ -5.395723638820606
207
+ ],
208
+ [
209
+ "gl",
210
+ -5.397241434678124
211
+ ],
212
+ [
213
+ "vv",
214
+ -5.402501062230561
215
+ ],
216
+ [
217
+ "as",
218
+ -5.4054133935553565
219
+ ],
220
+ [
221
+ "dl",
222
+ -5.408363687465943
223
+ ],
224
+ [
225
+ "da",
226
+ -5.419504788703245
227
+ ],
228
+ [
229
+ "lp",
230
+ -5.421171023058353
231
+ ],
232
+ [
233
+ "sg",
234
+ -5.425624703056531
235
+ ],
236
+ [
237
+ "vg",
238
+ -5.43958578125789
239
+ ],
240
+ [
241
+ "pg",
242
+ -5.443116085521833
243
+ ],
244
+ [
245
+ "ld",
246
+ -5.460648843958097
247
+ ],
248
+ [
249
+ "gv",
250
+ -5.461726128843196
251
+ ],
252
+ [
253
+ "at",
254
+ -5.4623194350596
255
+ ],
256
+ [
257
+ "tl",
258
+ -5.476791998017722
259
+ ],
260
+ [
261
+ "ae",
262
+ -5.485646676180689
263
+ ],
264
+ [
265
+ "ss",
266
+ -5.494587867484871
267
+ ],
268
+ [
269
+ "ad",
270
+ -5.503243044744844
271
+ ],
272
+ [
273
+ "le",
274
+ -5.513539597107796
275
+ ],
276
+ [
277
+ "er",
278
+ -5.525114213305434
279
+ ],
280
+ [
281
+ "dg",
282
+ -5.52631343875059
283
+ ],
284
+ [
285
+ "lt",
286
+ -5.535693817027219
287
+ ],
288
+ [
289
+ "ea",
290
+ -5.535755605860464
291
+ ],
292
+ [
293
+ "ia",
294
+ -5.536650327927395
295
+ ],
296
+ [
297
+ "tg",
298
+ -5.545634533570794
299
+ ],
300
+ [
301
+ "rg",
302
+ -5.547461100096692
303
+ ],
304
+ [
305
+ "vr",
306
+ -5.5482882187858085
307
+ ],
308
+ [
309
+ "rv",
310
+ -5.5960473758838525
311
+ ],
312
+ [
313
+ "x",
314
+ -5.608946077998157
315
+ ],
316
+ [
317
+ "rs",
318
+ -5.611870379615395
319
+ ],
320
+ [
321
+ "gs",
322
+ -5.626777482170128
323
+ ],
324
+ [
325
+ "ap",
326
+ -5.631965142077629
327
+ ],
328
+ [
329
+ "ai",
330
+ -5.636142688236619
331
+ ],
332
+ [
333
+ "ve",
334
+ -5.639894023082112
335
+ ],
336
+ [
337
+ "ge",
338
+ -5.655452846064064
339
+ ],
340
+ [
341
+ "pl",
342
+ -5.660462720181497
343
+ ],
344
+ [
345
+ "gd",
346
+ -5.664400349464522
347
+ ],
348
+ [
349
+ "ee",
350
+ -5.671898580128399
351
+ ],
352
+ [
353
+ "vd",
354
+ -5.672600958807292
355
+ ],
356
+ [
357
+ "dv",
358
+ -5.674250319761869
359
+ ],
360
+ [
361
+ "tv",
362
+ -5.675826469215492
363
+ ],
364
+ [
365
+ "gt",
366
+ -5.682517311383533
367
+ ],
368
+ [
369
+ "rp",
370
+ -5.687456812768204
371
+ ],
372
+ [
373
+ "vt",
374
+ -5.69613081375628
375
+ ],
376
+ [
377
+ "pv",
378
+ -5.701851465097279
379
+ ],
380
+ [
381
+ "vs",
382
+ -5.7061688041113765
383
+ ],
384
+ [
385
+ "dr",
386
+ -5.729755803330297
387
+ ],
388
+ [
389
+ "ig",
390
+ -5.740135021060222
391
+ ],
392
+ [
393
+ "sr",
394
+ -5.740787547090417
395
+ ],
396
+ [
397
+ "re",
398
+ -5.753341772270069
399
+ ],
400
+ [
401
+ "sv",
402
+ -5.773162831092586
403
+ ],
404
+ [
405
+ "ev",
406
+ -5.773357017945507
407
+ ],
408
+ [
409
+ "gi",
410
+ -5.779110480802128
411
+ ],
412
+ [
413
+ "sp",
414
+ -5.78792820379083
415
+ ],
416
+ [
417
+ "tt",
418
+ -5.788142314600666
419
+ ],
420
+ [
421
+ "li",
422
+ -5.7894561148560495
423
+ ],
424
+ [
425
+ "tp",
426
+ -5.811521012006784
427
+ ],
428
+ [
429
+ "de",
430
+ -5.81161181019956
431
+ ],
432
+ [
433
+ "pp",
434
+ -5.817117213994678
435
+ ],
436
+ [
437
+ "aq",
438
+ -5.8208411958823
439
+ ],
440
+ [
441
+ "vp",
442
+ -5.82086135309334
443
+ ],
444
+ [
445
+ "pr",
446
+ -5.8300324993501444
447
+ ],
448
+ [
449
+ "rd",
450
+ -5.832959938409839
451
+ ],
452
+ [
453
+ "ps",
454
+ -5.833765158121672
455
+ ],
456
+ [
457
+ "iv",
458
+ -5.834924940666063
459
+ ],
460
+ [
461
+ "st",
462
+ -5.836422116950422
463
+ ],
464
+ [
465
+ "ts",
466
+ -5.845334860786274
467
+ ],
468
+ [
469
+ "qa",
470
+ -5.8602742392000575
471
+ ],
472
+ [
473
+ "dp",
474
+ -5.889141466606141
475
+ ],
476
+ [
477
+ "lf",
478
+ -5.891988029154746
479
+ ],
480
+ [
481
+ "ei",
482
+ -5.8942462729591
483
+ ],
484
+ [
485
+ "ql",
486
+ -5.896280144036849
487
+ ],
488
+ [
489
+ "pe",
490
+ -5.8989822786508554
491
+ ],
492
+ [
493
+ "dd",
494
+ -5.899947270574684
495
+ ],
496
+ [
497
+ "fa",
498
+ -5.908091119842952
499
+ ],
500
+ [
501
+ "il",
502
+ -5.909313751415439
503
+ ],
504
+ [
505
+ "pd",
506
+ -5.916882430976356
507
+ ],
508
+ [
509
+ "lk",
510
+ -5.926334580525451
511
+ ],
512
+ [
513
+ "kk",
514
+ -5.929554253156709
515
+ ],
516
+ [
517
+ "af",
518
+ -5.9304097490202015
519
+ ],
520
+ [
521
+ "fg",
522
+ -5.934233305800573
523
+ ],
524
+ [
525
+ "eg",
526
+ -5.935506392701038
527
+ ],
528
+ [
529
+ "gk",
530
+ -5.943123839548651
531
+ ],
532
+ [
533
+ "ak",
534
+ -5.947812328653333
535
+ ],
536
+ [
537
+ "fl",
538
+ -5.948919578195277
539
+ ],
540
+ [
541
+ "id",
542
+ -5.95000787138518
543
+ ],
544
+ [
545
+ "ri",
546
+ -5.953187755007269
547
+ ],
548
+ [
549
+ "kl",
550
+ -5.954798705995039
551
+ ],
552
+ [
553
+ "vi",
554
+ -5.9632196203527315
555
+ ],
556
+ [
557
+ "lq",
558
+ -5.981225013933381
559
+ ],
560
+ [
561
+ "ie",
562
+ -5.987842234328983
563
+ ],
564
+ [
565
+ "gp",
566
+ -5.996294580750687
567
+ ],
568
+ [
569
+ "ek",
570
+ -6.006345600102266
571
+ ],
572
+ [
573
+ "rt",
574
+ -6.0121209086764775
575
+ ],
576
+ [
577
+ "ka",
578
+ -6.01491898744084
579
+ ],
580
+ [
581
+ "gf",
582
+ -6.015842456861838
583
+ ],
584
+ [
585
+ "qr",
586
+ -6.016464987489098
587
+ ],
588
+ [
589
+ "is",
590
+ -6.045176758128992
591
+ ],
592
+ [
593
+ "nl",
594
+ -6.085959225557872
595
+ ],
596
+ [
597
+ "pt",
598
+ -6.0880399997621435
599
+ ],
600
+ [
601
+ "si",
602
+ -6.0889675409573325
603
+ ],
604
+ [
605
+ "ti",
606
+ -6.089641985742588
607
+ ],
608
+ [
609
+ "rq",
610
+ -6.093718164152271
611
+ ],
612
+ [
613
+ "tr",
614
+ -6.102011832883294
615
+ ],
616
+ [
617
+ "sd",
618
+ -6.106249171332308
619
+ ],
620
+ [
621
+ "gq",
622
+ -6.1155110312256795
623
+ ],
624
+ [
625
+ "eq",
626
+ -6.147800209212976
627
+ ],
628
+ [
629
+ "ln",
630
+ -6.173015268781688
631
+ ],
632
+ [
633
+ "ng",
634
+ -6.181732269229789
635
+ ],
636
+ [
637
+ "se",
638
+ -6.198317089876394
639
+ ],
640
+ [
641
+ "sf",
642
+ -6.200053977367045
643
+ ],
644
+ [
645
+ "na",
646
+ -6.2011622959247745
647
+ ],
648
+ [
649
+ "fv",
650
+ -6.21270323523432
651
+ ],
652
+ [
653
+ "et",
654
+ -6.213211850957885
655
+ ],
656
+ [
657
+ "ed",
658
+ -6.220732944550861
659
+ ],
660
+ [
661
+ "vf",
662
+ -6.235612361132397
663
+ ],
664
+ [
665
+ "it",
666
+ -6.236237188173279
667
+ ],
668
+ [
669
+ "hl",
670
+ -6.242607870761024
671
+ ],
672
+ [
673
+ "rf",
674
+ -6.242719164870595
675
+ ],
676
+ [
677
+ "ke",
678
+ -6.243551778618629
679
+ ],
680
+ [
681
+ "fs",
682
+ -6.247993226234106
683
+ ],
684
+ [
685
+ "an",
686
+ -6.26119987064193
687
+ ],
688
+ [
689
+ "ma",
690
+ -6.264278246418684
691
+ ],
692
+ [
693
+ "ep",
694
+ -6.265933814263626
695
+ ],
696
+ [
697
+ "gn",
698
+ -6.2690329158590625
699
+ ],
700
+ [
701
+ "yl",
702
+ -6.273341632364284
703
+ ],
704
+ [
705
+ "fd",
706
+ -6.276005658681607
707
+ ],
708
+ [
709
+ "td",
710
+ -6.280432226339416
711
+ ],
712
+ [
713
+ "qv",
714
+ -6.2888497616309
715
+ ],
716
+ [
717
+ "ha",
718
+ -6.296053201668862
719
+ ],
720
+ [
721
+ "lh",
722
+ -6.30246972284626
723
+ ],
724
+ [
725
+ "ki",
726
+ -6.3175004329885045
727
+ ],
728
+ [
729
+ "ml",
730
+ -6.328114226448536
731
+ ],
732
+ [
733
+ "hg",
734
+ -6.33088519345314
735
+ ],
736
+ [
737
+ "gy",
738
+ -6.333659448005497
739
+ ],
740
+ [
741
+ "es",
742
+ -6.334612606722317
743
+ ],
744
+ [
745
+ "rk",
746
+ -6.341736778775461
747
+ ],
748
+ [
749
+ "hr",
750
+ -6.354408701195906
751
+ ],
752
+ [
753
+ "kv",
754
+ -6.3558736445353805
755
+ ],
756
+ [
757
+ "di",
758
+ -6.358575255803077
759
+ ],
760
+ [
761
+ "kt",
762
+ -6.367157900996915
763
+ ],
764
+ [
765
+ "ah",
766
+ -6.369617063929681
767
+ ],
768
+ [
769
+ "ks",
770
+ -6.376100446248731
771
+ ],
772
+ [
773
+ "qq",
774
+ -6.37798503837473
775
+ ],
776
+ [
777
+ "ir",
778
+ -6.3857457573601675
779
+ ],
780
+ [
781
+ "rh",
782
+ -6.387183443208819
783
+ ],
784
+ [
785
+ "kr",
786
+ -6.389145725185193
787
+ ],
788
+ [
789
+ "ff",
790
+ -6.394310366090611
791
+ ],
792
+ [
793
+ "np",
794
+ -6.394909773835787
795
+ ],
796
+ [
797
+ "qp",
798
+ -6.400762509720662
799
+ ],
800
+ [
801
+ "vk",
802
+ -6.408006270334974
803
+ ],
804
+ [
805
+ "ip",
806
+ -6.426055131001776
807
+ ],
808
+ [
809
+ "vq",
810
+ -6.436536338470235
811
+ ],
812
+ [
813
+ "ya",
814
+ -6.436780445351445
815
+ ],
816
+ [
817
+ "yg",
818
+ -6.437618271718964
819
+ ],
820
+ [
821
+ "vn",
822
+ -6.439633057236559
823
+ ],
824
+ [
825
+ "te",
826
+ -6.4462825523394685
827
+ ],
828
+ [
829
+ "nv",
830
+ -6.447181836880107
831
+ ],
832
+ [
833
+ "qg",
834
+ -6.450639119164498
835
+ ],
836
+ [
837
+ "ay",
838
+ -6.459484467855098
839
+ ],
840
+ [
841
+ "df",
842
+ -6.464086065081403
843
+ ],
844
+ [
845
+ "gh",
846
+ -6.466932879944524
847
+ ],
848
+ [
849
+ "tf",
850
+ -6.469809893138768
851
+ ],
852
+ [
853
+ "sk",
854
+ -6.479002868749177
855
+ ],
856
+ [
857
+ "ds",
858
+ -6.485484141843836
859
+ ],
860
+ [
861
+ "sn",
862
+ -6.4878031190687935
863
+ ],
864
+ [
865
+ "ii",
866
+ -6.504258965341522
867
+ ],
868
+ [
869
+ "kg",
870
+ -6.507398478812792
871
+ ],
872
+ [
873
+ "kp",
874
+ -6.518085192976523
875
+ ],
876
+ [
877
+ "ly",
878
+ -6.522152923737886
879
+ ],
880
+ [
881
+ "aaa",
882
+ -6.526671396118413
883
+ ],
884
+ [
885
+ "ft",
886
+ -6.5295453018170875
887
+ ],
888
+ [
889
+ "hp",
890
+ -6.53078009472407
891
+ ],
892
+ [
893
+ "kd",
894
+ -6.547001972884521
895
+ ],
896
+ [
897
+ "qi",
898
+ -6.556091048164445
899
+ ],
900
+ [
901
+ "pi",
902
+ -6.557328188755392
903
+ ],
904
+ [
905
+ "qs",
906
+ -6.55968638147019
907
+ ],
908
+ [
909
+ "dt",
910
+ -6.579466923820265
911
+ ],
912
+ [
913
+ "ns",
914
+ -6.581085923267036
915
+ ],
916
+ [
917
+ "sq",
918
+ -6.583571636718558
919
+ ],
920
+ [
921
+ "kn",
922
+ -6.601121373063897
923
+ ],
924
+ [
925
+ "en",
926
+ -6.606175960425073
927
+ ],
928
+ [
929
+ "fe",
930
+ -6.60685998733843
931
+ ],
932
+ [
933
+ "tn",
934
+ -6.615442326280007
935
+ ],
936
+ [
937
+ "wl",
938
+ -6.618302734526351
939
+ ],
940
+ [
941
+ "pq",
942
+ -6.621155961329553
943
+ ],
944
+ [
945
+ "ni",
946
+ -6.624879415351742
947
+ ],
948
+ [
949
+ "yr",
950
+ -6.625910436920403
951
+ ],
952
+ [
953
+ "qt",
954
+ -6.6307505077769555
955
+ ],
956
+ [
957
+ "pf",
958
+ -6.6308412369595455
959
+ ],
960
+ [
961
+ "rn",
962
+ -6.631740741295436
963
+ ],
964
+ [
965
+ "in",
966
+ -6.639584636482613
967
+ ],
968
+ [
969
+ "hv",
970
+ -6.640247883065955
971
+ ],
972
+ [
973
+ "ik",
974
+ -6.650348904814281
975
+ ],
976
+ [
977
+ "ry",
978
+ -6.655536274374892
979
+ ],
980
+ [
981
+ "dq",
982
+ -6.662650543901028
983
+ ],
984
+ [
985
+ "fr",
986
+ -6.677637695424519
987
+ ],
988
+ [
989
+ "if",
990
+ -6.681523029800585
991
+ ],
992
+ [
993
+ "mr",
994
+ -6.701260323121653
995
+ ],
996
+ [
997
+ "ef",
998
+ -6.721187968715531
999
+ ],
1000
+ [
1001
+ "am",
1002
+ -6.722392548689049
1003
+ ],
1004
+ [
1005
+ "ms",
1006
+ -6.722957219864604
1007
+ ],
1008
+ [
1009
+ "qe",
1010
+ -6.723551837092796
1011
+ ],
1012
+ [
1013
+ "sy",
1014
+ -6.728733788049803
1015
+ ],
1016
+ [
1017
+ "mt",
1018
+ -6.739373333977653
1019
+ ],
1020
+ [
1021
+ "yv",
1022
+ -6.742995012080062
1023
+ ],
1024
+ [
1025
+ "eh",
1026
+ -6.74876854304447
1027
+ ],
1028
+ [
1029
+ "vh",
1030
+ -6.750406955107264
1031
+ ],
1032
+ [
1033
+ "nn",
1034
+ -6.7566005866277745
1035
+ ],
1036
+ [
1037
+ "yd",
1038
+ -6.760908295045155
1039
+ ],
1040
+ [
1041
+ "tk",
1042
+ -6.762557739296788
1043
+ ],
1044
+ [
1045
+ "nr",
1046
+ -6.7630435456712235
1047
+ ],
1048
+ [
1049
+ "nt",
1050
+ -6.767721256103645
1051
+ ],
1052
+ [
1053
+ "mv",
1054
+ -6.769331033226527
1055
+ ],
1056
+ [
1057
+ "dk",
1058
+ -6.781027506796054
1059
+ ],
1060
+ [
1061
+ "nd",
1062
+ -6.7895149322464246
1063
+ ],
1064
+ [
1065
+ "mg",
1066
+ -6.7954010994947005
1067
+ ],
1068
+ [
1069
+ "kq",
1070
+ -6.796318304309551
1071
+ ],
1072
+ [
1073
+ "qk",
1074
+ -6.80278943044512
1075
+ ],
1076
+ [
1077
+ "gw",
1078
+ -6.810481762215055
1079
+ ],
1080
+ [
1081
+ "dy",
1082
+ -6.813654384938911
1083
+ ],
1084
+ [
1085
+ "ys",
1086
+ -6.822655724067237
1087
+ ],
1088
+ [
1089
+ "vy",
1090
+ -6.82632505735617
1091
+ ],
1092
+ [
1093
+ "wr",
1094
+ -6.829081855085716
1095
+ ],
1096
+ [
1097
+ "mp",
1098
+ -6.839761205069106
1099
+ ],
1100
+ [
1101
+ "fi",
1102
+ -6.85048550044306
1103
+ ],
1104
+ [
1105
+ "aw",
1106
+ -6.863259418010875
1107
+ ],
1108
+ [
1109
+ "pk",
1110
+ -6.865582196294758
1111
+ ],
1112
+ [
1113
+ "hd",
1114
+ -6.884016503575918
1115
+ ],
1116
+ [
1117
+ "ala",
1118
+ -6.885081628055053
1119
+ ],
1120
+ [
1121
+ "pn",
1122
+ -6.886470616505994
1123
+ ],
1124
+ [
1125
+ "fp",
1126
+ -6.888007121641019
1127
+ ],
1128
+ [
1129
+ "ty",
1130
+ -6.89272350275464
1131
+ ],
1132
+ [
1133
+ "cg",
1134
+ -6.893033028167876
1135
+ ],
1136
+ [
1137
+ "tq",
1138
+ -6.894648525208451
1139
+ ],
1140
+ [
1141
+ "ac",
1142
+ -6.922264253254053
1143
+ ],
1144
+ [
1145
+ "dh",
1146
+ -6.940052819724661
1147
+ ],
1148
+ [
1149
+ "th",
1150
+ -6.946611361274389
1151
+ ],
1152
+ [
1153
+ "nf",
1154
+ -6.94773061409305
1155
+ ],
1156
+ [
1157
+ "yt",
1158
+ -6.947895582263566
1159
+ ],
1160
+ [
1161
+ "ne",
1162
+ -6.950186435523683
1163
+ ],
1164
+ [
1165
+ "laa",
1166
+ -6.9618966428821984
1167
+ ],
1168
+ [
1169
+ "mk",
1170
+ -6.963876569469246
1171
+ ],
1172
+ [
1173
+ "fn",
1174
+ -6.9651216440287556
1175
+ ],
1176
+ [
1177
+ "mi",
1178
+ -6.973364756865921
1179
+ ],
1180
+ [
1181
+ "rw",
1182
+ -6.976192205590854
1183
+ ],
1184
+ [
1185
+ "sh",
1186
+ -6.976252466655415
1187
+ ],
1188
+ [
1189
+ "yf",
1190
+ -6.98466102259021
1191
+ ],
1192
+ [
1193
+ "ye",
1194
+ -6.985376860426236
1195
+ ],
1196
+ [
1197
+ "iy",
1198
+ -6.997339619581055
1199
+ ],
1200
+ [
1201
+ "ph",
1202
+ -7.002791342761229
1203
+ ],
1204
+ [
1205
+ "ca",
1206
+ -7.00714306958699
1207
+ ],
1208
+ [
1209
+ "nk",
1210
+ -7.01679175195992
1211
+ ],
1212
+ [
1213
+ "vc",
1214
+ -7.026717686171381
1215
+ ],
1216
+ [
1217
+ "dn",
1218
+ -7.0329619448909995
1219
+ ],
1220
+ [
1221
+ "lc",
1222
+ -7.04162123422458
1223
+ ],
1224
+ [
1225
+ "iq",
1226
+ -7.052050568293268
1227
+ ],
1228
+ [
1229
+ "qd",
1230
+ -7.056415484136933
1231
+ ],
1232
+ [
1233
+ "ht",
1234
+ -7.057894344010668
1235
+ ],
1236
+ [
1237
+ "cr",
1238
+ -7.062175308655888
1239
+ ],
1240
+ [
1241
+ "ey",
1242
+ -7.067369296531744
1243
+ ],
1244
+ [
1245
+ "py",
1246
+ -7.06762170686601
1247
+ ],
1248
+ [
1249
+ "gc",
1250
+ -7.068569810281758
1251
+ ],
1252
+ [
1253
+ "he",
1254
+ -7.068724837052839
1255
+ ],
1256
+ [
1257
+ "ws",
1258
+ -7.083834967007803
1259
+ ],
1260
+ [
1261
+ "qn",
1262
+ -7.094736754675786
1263
+ ],
1264
+ [
1265
+ "hs",
1266
+ -7.097107793695214
1267
+ ],
1268
+ [
1269
+ "gm",
1270
+ -7.102912164950451
1271
+ ],
1272
+ [
1273
+ "cl",
1274
+ -7.117371301671321
1275
+ ],
1276
+ [
1277
+ "hh",
1278
+ -7.119373071454689
1279
+ ],
1280
+ [
1281
+ "cs",
1282
+ -7.123202520839005
1283
+ ],
1284
+ [
1285
+ "nq",
1286
+ -7.123573885267396
1287
+ ],
1288
+ [
1289
+ "me",
1290
+ -7.140295705363691
1291
+ ],
1292
+ [
1293
+ "fy",
1294
+ -7.145687063516517
1295
+ ],
1296
+ [
1297
+ "ggg",
1298
+ -7.148129052637575
1299
+ ],
1300
+ [
1301
+ "aal",
1302
+ -7.149286650265301
1303
+ ],
1304
+ [
1305
+ "yp",
1306
+ -7.156139099185067
1307
+ ],
1308
+ [
1309
+ "qf",
1310
+ -7.157060249855512
1311
+ ],
1312
+ [
1313
+ "wa",
1314
+ -7.171499064266852
1315
+ ],
1316
+ [
1317
+ "ky",
1318
+ -7.1761559132310495
1319
+ ],
1320
+ [
1321
+ "ny",
1322
+ -7.181344251441002
1323
+ ],
1324
+ [
1325
+ "lw",
1326
+ -7.18428827548775
1327
+ ],
1328
+ [
1329
+ "sc",
1330
+ -7.188554190683416
1331
+ ],
1332
+ [
1333
+ "md",
1334
+ -7.189176149884691
1335
+ ],
1336
+ [
1337
+ "rc",
1338
+ -7.2023406563778956
1339
+ ],
1340
+ [
1341
+ "aag",
1342
+ -7.215624965192319
1343
+ ],
1344
+ [
1345
+ "lm",
1346
+ -7.21766907832032
1347
+ ],
1348
+ [
1349
+ "kf",
1350
+ -7.220208888998105
1351
+ ],
1352
+ [
1353
+ "sw",
1354
+ -7.235424094169154
1355
+ ],
1356
+ [
1357
+ "yy",
1358
+ -7.248760750503438
1359
+ ],
1360
+ [
1361
+ "rrr",
1362
+ -7.256636321517087
1363
+ ],
1364
+ [
1365
+ "fk",
1366
+ -7.26446436072386
1367
+ ],
1368
+ [
1369
+ "qh",
1370
+ -7.272573702000296
1371
+ ],
1372
+ [
1373
+ "fq",
1374
+ -7.277283326780614
1375
+ ],
1376
+ [
1377
+ "yn",
1378
+ -7.279879376992396
1379
+ ],
1380
+ [
1381
+ "yq",
1382
+ -7.282357112300778
1383
+ ],
1384
+ [
1385
+ "hf",
1386
+ -7.294922693842759
1387
+ ],
1388
+ [
1389
+ "aga",
1390
+ -7.3089624623105145
1391
+ ],
1392
+ [
1393
+ "mn",
1394
+ -7.309472360282593
1395
+ ],
1396
+ [
1397
+ "ih",
1398
+ -7.33053451667163
1399
+ ],
1400
+ [
1401
+ "ava",
1402
+ -7.335882965161225
1403
+ ],
1404
+ [
1405
+ "cv",
1406
+ -7.339016321207151
1407
+ ],
1408
+ [
1409
+ "hq",
1410
+ -7.343025081428985
1411
+ ],
1412
+ [
1413
+ "wt",
1414
+ -7.3434986168208845
1415
+ ],
1416
+ [
1417
+ "mq",
1418
+ -7.354400689372477
1419
+ ],
1420
+ [
1421
+ "wg",
1422
+ -7.36382580580741
1423
+ ],
1424
+ [
1425
+ "yi",
1426
+ -7.3644973873263915
1427
+ ],
1428
+ [
1429
+ "dw",
1430
+ -7.366350376841645
1431
+ ],
1432
+ [
1433
+ "em",
1434
+ -7.374811392160398
1435
+ ],
1436
+ [
1437
+ "lla",
1438
+ -7.392768462701696
1439
+ ],
1440
+ [
1441
+ "vaa",
1442
+ -7.41721359052082
1443
+ ],
1444
+ [
1445
+ "lll",
1446
+ -7.425657863607588
1447
+ ],
1448
+ [
1449
+ "wv",
1450
+ -7.428509738612526
1451
+ ],
1452
+ [
1453
+ "yk",
1454
+ -7.431757248014991
1455
+ ],
1456
+ [
1457
+ "vm",
1458
+ -7.433809580239302
1459
+ ],
1460
+ [
1461
+ "vw",
1462
+ -7.445818765018977
1463
+ ],
1464
+ [
1465
+ "aar",
1466
+ -7.4717401761921565
1467
+ ],
1468
+ [
1469
+ "hi",
1470
+ -7.486342911357944
1471
+ ],
1472
+ [
1473
+ "tw",
1474
+ -7.488981567298911
1475
+ ],
1476
+ [
1477
+ "cp",
1478
+ -7.492932113247878
1479
+ ],
1480
+ [
1481
+ "fh",
1482
+ -7.504062101936077
1483
+ ],
1484
+ [
1485
+ "wi",
1486
+ -7.50616719068525
1487
+ ],
1488
+ [
1489
+ "qy",
1490
+ -7.513321084614569
1491
+ ],
1492
+ [
1493
+ "wp",
1494
+ -7.518799984375065
1495
+ ],
1496
+ [
1497
+ "all",
1498
+ -7.53302752114244
1499
+ ],
1500
+ [
1501
+ "gag",
1502
+ -7.533965743298678
1503
+ ],
1504
+ [
1505
+ "ara",
1506
+ -7.537367311945358
1507
+ ],
1508
+ [
1509
+ "pw",
1510
+ -7.5446201708160405
1511
+ ],
1512
+ [
1513
+ "raa",
1514
+ -7.5669264012957616
1515
+ ],
1516
+ [
1517
+ "kh",
1518
+ -7.577026433974584
1519
+ ],
1520
+ [
1521
+ "wq",
1522
+ -7.581697650093153
1523
+ ],
1524
+ [
1525
+ "lar",
1526
+ -7.584606758968199
1527
+ ],
1528
+ [
1529
+ "lag",
1530
+ -7.590419078832234
1531
+ ],
1532
+ [
1533
+ "tc",
1534
+ -7.59811062974657
1535
+ ],
1536
+ [
1537
+ "ppp",
1538
+ -7.611586621905673
1539
+ ],
1540
+ [
1541
+ "mf",
1542
+ -7.616872823329912
1543
+ ],
1544
+ [
1545
+ "cd",
1546
+ -7.624203031434293
1547
+ ],
1548
+ [
1549
+ "lgl",
1550
+ -7.635044047159843
1551
+ ],
1552
+ [
1553
+ "rar",
1554
+ -7.648183656824516
1555
+ ],
1556
+ [
1557
+ "lal",
1558
+ -7.656940108094723
1559
+ ],
1560
+ [
1561
+ "arr",
1562
+ -7.661151098652345
1563
+ ],
1564
+ [
1565
+ "ic",
1566
+ -7.683928182319702
1567
+ ],
1568
+ [
1569
+ "wd",
1570
+ -7.688213319192755
1571
+ ],
1572
+ [
1573
+ "fc",
1574
+ -7.688873646632947
1575
+ ],
1576
+ [
1577
+ "hy",
1578
+ -7.701008250661593
1579
+ ],
1580
+ [
1581
+ "wn",
1582
+ -7.707791318520682
1583
+ ],
1584
+ [
1585
+ "ew",
1586
+ -7.708192951886026
1587
+ ],
1588
+ [
1589
+ "wk",
1590
+ -7.727612230894543
1591
+ ],
1592
+ [
1593
+ "vla",
1594
+ -7.729698852587845
1595
+ ],
1596
+ [
1597
+ "agg",
1598
+ -7.737282862442257
1599
+ ],
1600
+ [
1601
+ "rlr",
1602
+ -7.744727930333475
1603
+ ],
1604
+ [
1605
+ "lae",
1606
+ -7.766894216302047
1607
+ ],
1608
+ [
1609
+ "pap",
1610
+ -7.775081139986716
1611
+ ],
1612
+ [
1613
+ "u",
1614
+ -17.47096861423773
1615
+ ],
1616
+ [
1617
+ "b",
1618
+ -17.671612117272954
1619
+ ],
1620
+ [
1621
+ "z",
1622
+ -18.91242536750401
1623
+ ],
1624
+ [
1625
+ "o",
1626
+ -20.362425367503242
1627
+ ]
1628
+ ]
1629
+ }
1630
+ }