colesimmons commited on
Commit
4c7fb63
1 Parent(s): 366d64f

Upload tokenizer

Browse files
Files changed (4) hide show
  1. README.md +199 -0
  2. special_tokens_map.json +130 -0
  3. tokenizer.json +789 -0
  4. tokenizer_config.json +115 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
special_tokens_map.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "</s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<RULING>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<mask>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "<BLANK_SPACE>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ {
32
+ "content": "<unk>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ {
39
+ "content": "\n",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ {
46
+ "content": "<s>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ {
53
+ "content": "<COLUMN>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ },
59
+ {
60
+ "content": "<SURFACE>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false
65
+ },
66
+ {
67
+ "content": "...",
68
+ "lstrip": false,
69
+ "normalized": false,
70
+ "rstrip": false,
71
+ "single_word": false
72
+ },
73
+ {
74
+ "content": "<pad>",
75
+ "lstrip": false,
76
+ "normalized": false,
77
+ "rstrip": false,
78
+ "single_word": false
79
+ }
80
+ ],
81
+ "bos_token": {
82
+ "content": "<s>",
83
+ "lstrip": false,
84
+ "normalized": false,
85
+ "rstrip": false,
86
+ "single_word": false
87
+ },
88
+ "cls_token": {
89
+ "content": "<s>",
90
+ "lstrip": false,
91
+ "normalized": false,
92
+ "rstrip": false,
93
+ "single_word": false
94
+ },
95
+ "eos_token": {
96
+ "content": "</s>",
97
+ "lstrip": false,
98
+ "normalized": false,
99
+ "rstrip": false,
100
+ "single_word": false
101
+ },
102
+ "mask_token": {
103
+ "content": "<mask>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false
108
+ },
109
+ "pad_token": {
110
+ "content": "<pad>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false
115
+ },
116
+ "sep_token": {
117
+ "content": "</s>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false
122
+ },
123
+ "unk_token": {
124
+ "content": "<unk>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false
129
+ }
130
+ }
tokenizer.json ADDED
@@ -0,0 +1,789 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "<s>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 1,
17
+ "content": "<pad>",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 2,
26
+ "content": "</s>",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ },
33
+ {
34
+ "id": 3,
35
+ "content": "<unk>",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 4,
44
+ "content": "<mask>",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
+ },
51
+ {
52
+ "id": 5,
53
+ "content": "\n",
54
+ "single_word": false,
55
+ "lstrip": false,
56
+ "rstrip": false,
57
+ "normalized": false,
58
+ "special": true
59
+ },
60
+ {
61
+ "id": 6,
62
+ "content": "<SURFACE>",
63
+ "single_word": false,
64
+ "lstrip": false,
65
+ "rstrip": false,
66
+ "normalized": false,
67
+ "special": true
68
+ },
69
+ {
70
+ "id": 7,
71
+ "content": "<COLUMN>",
72
+ "single_word": false,
73
+ "lstrip": false,
74
+ "rstrip": false,
75
+ "normalized": false,
76
+ "special": true
77
+ },
78
+ {
79
+ "id": 8,
80
+ "content": "<BLANK_SPACE>",
81
+ "single_word": false,
82
+ "lstrip": false,
83
+ "rstrip": false,
84
+ "normalized": false,
85
+ "special": true
86
+ },
87
+ {
88
+ "id": 9,
89
+ "content": "<RULING>",
90
+ "single_word": false,
91
+ "lstrip": false,
92
+ "rstrip": false,
93
+ "normalized": false,
94
+ "special": true
95
+ },
96
+ {
97
+ "id": 10,
98
+ "content": "...",
99
+ "single_word": false,
100
+ "lstrip": false,
101
+ "rstrip": false,
102
+ "normalized": false,
103
+ "special": true
104
+ }
105
+ ],
106
+ "normalizer": null,
107
+ "pre_tokenizer": {
108
+ "type": "UnicodeScripts"
109
+ },
110
+ "post_processor": {
111
+ "type": "RobertaProcessing",
112
+ "sep": [
113
+ "</s>",
114
+ 2
115
+ ],
116
+ "cls": [
117
+ "<s>",
118
+ 0
119
+ ],
120
+ "trim_offsets": true,
121
+ "add_prefix_space": true
122
+ },
123
+ "decoder": {
124
+ "type": "BPEDecoder",
125
+ "suffix": "</w>"
126
+ },
127
+ "model": {
128
+ "type": "BPE",
129
+ "dropout": null,
130
+ "unk_token": null,
131
+ "continuing_subword_prefix": null,
132
+ "end_of_word_suffix": null,
133
+ "fuse_unk": false,
134
+ "byte_fallback": false,
135
+ "ignore_merges": false,
136
+ "vocab": {
137
+ "<s>": 0,
138
+ "<pad>": 1,
139
+ "</s>": 2,
140
+ "<unk>": 3,
141
+ "<mask>": 4,
142
+ "\n": 5,
143
+ "<SURFACE>": 6,
144
+ "<COLUMN>": 7,
145
+ "<BLANK_SPACE>": 8,
146
+ "<RULING>": 9,
147
+ "...": 10,
148
+ " ": 11,
149
+ "𒀀": 12,
150
+ "𒀄": 13,
151
+ "𒀇": 14,
152
+ "𒀉": 15,
153
+ "𒀊": 16,
154
+ "𒀋": 17,
155
+ "𒀍": 18,
156
+ "𒀏": 19,
157
+ "𒀔": 20,
158
+ "𒀕": 21,
159
+ "𒀖": 22,
160
+ "𒀘": 23,
161
+ "𒀚": 24,
162
+ "𒀛": 25,
163
+ "𒀜": 26,
164
+ "𒀝": 27,
165
+ "𒀞": 28,
166
+ "𒀠": 29,
167
+ "𒀧": 30,
168
+ "𒀩": 31,
169
+ "𒀪": 32,
170
+ "𒀫": 33,
171
+ "𒀬": 34,
172
+ "𒀭": 35,
173
+ "𒀮": 36,
174
+ "𒀯": 37,
175
+ "𒀲": 38,
176
+ "𒀳": 39,
177
+ "𒀴": 40,
178
+ "𒀵": 41,
179
+ "𒀶": 42,
180
+ "𒀸": 43,
181
+ "𒀹": 44,
182
+ "𒀼": 45,
183
+ "𒀾": 46,
184
+ "𒀿": 47,
185
+ "𒁀": 48,
186
+ "𒁁": 49,
187
+ "𒁃": 50,
188
+ "𒁄": 51,
189
+ "𒁆": 52,
190
+ "𒁇": 53,
191
+ "𒁈": 54,
192
+ "𒁉": 55,
193
+ "𒁋": 56,
194
+ "𒁌": 57,
195
+ "𒁍": 58,
196
+ "𒁎": 59,
197
+ "𒁑": 60,
198
+ "𒁒": 61,
199
+ "𒁓": 62,
200
+ "𒁔": 63,
201
+ "𒁕": 64,
202
+ "𒁖": 65,
203
+ "𒁚": 66,
204
+ "𒁛": 67,
205
+ "𒁞": 68,
206
+ "𒁟": 69,
207
+ "𒁡": 70,
208
+ "𒁣": 71,
209
+ "𒁥": 72,
210
+ "𒁦": 73,
211
+ "𒁪": 74,
212
+ "𒁬": 75,
213
+ "𒁭": 76,
214
+ "𒁮": 77,
215
+ "𒁯": 78,
216
+ "𒁰": 79,
217
+ "𒁱": 80,
218
+ "𒁲": 81,
219
+ "𒁳": 82,
220
+ "𒁴": 83,
221
+ "𒁵": 84,
222
+ "𒁶": 85,
223
+ "𒁷": 86,
224
+ "𒁹": 87,
225
+ "𒁺": 88,
226
+ "𒁻": 89,
227
+ "𒁼": 90,
228
+ "𒁽": 91,
229
+ "𒁾": 92,
230
+ "𒂀": 93,
231
+ "𒂁": 94,
232
+ "𒂂": 95,
233
+ "𒂃": 96,
234
+ "𒂄": 97,
235
+ "𒂅": 98,
236
+ "𒂆": 99,
237
+ "𒂇": 100,
238
+ "𒂈": 101,
239
+ "𒂉": 102,
240
+ "𒂊": 103,
241
+ "𒂍": 104,
242
+ "𒂔": 105,
243
+ "𒂕": 106,
244
+ "𒂖": 107,
245
+ "𒂗": 108,
246
+ "𒂙": 109,
247
+ "𒂚": 110,
248
+ "𒂞": 111,
249
+ "𒂟": 112,
250
+ "𒂠": 113,
251
+ "𒂡": 114,
252
+ "𒂢": 115,
253
+ "𒂣": 116,
254
+ "𒂤": 117,
255
+ "𒂥": 118,
256
+ "𒂦": 119,
257
+ "𒂨": 120,
258
+ "𒂫": 121,
259
+ "𒂬": 122,
260
+ "𒂮": 123,
261
+ "𒂯": 124,
262
+ "𒂰": 125,
263
+ "𒂵": 126,
264
+ "𒂶": 127,
265
+ "𒂷": 128,
266
+ "𒂸": 129,
267
+ "𒂼": 130,
268
+ "𒃅": 131,
269
+ "𒃉": 132,
270
+ "𒃋": 133,
271
+ "𒃌": 134,
272
+ "𒃍": 135,
273
+ "𒃎": 136,
274
+ "𒃕": 137,
275
+ "𒃘": 138,
276
+ "𒃙": 139,
277
+ "𒃞": 140,
278
+ "𒃟": 141,
279
+ "𒃠": 142,
280
+ "𒃡": 143,
281
+ "𒃢": 144,
282
+ "𒃣": 145,
283
+ "𒃥": 146,
284
+ "𒃩": 147,
285
+ "𒃫": 148,
286
+ "𒃮": 149,
287
+ "𒃰": 150,
288
+ "𒃱": 151,
289
+ "𒃲": 152,
290
+ "𒃳": 153,
291
+ "𒃴": 154,
292
+ "𒃵": 155,
293
+ "𒃶": 156,
294
+ "𒃷": 157,
295
+ "𒃸": 158,
296
+ "𒃹": 159,
297
+ "𒃺": 160,
298
+ "𒃻": 161,
299
+ "𒃼": 162,
300
+ "𒃽": 163,
301
+ "𒃾": 164,
302
+ "𒃿": 165,
303
+ "𒄀": 166,
304
+ "𒄃": 167,
305
+ "𒄄": 168,
306
+ "𒄆": 169,
307
+ "𒄇": 170,
308
+ "𒄈": 171,
309
+ "𒄉": 172,
310
+ "𒄊": 173,
311
+ "𒄋": 174,
312
+ "𒄌": 175,
313
+ "𒄎": 176,
314
+ "𒄐": 177,
315
+ "𒄑": 178,
316
+ "𒄒": 179,
317
+ "𒄕": 180,
318
+ "𒄖": 181,
319
+ "𒄗": 182,
320
+ "𒄘": 183,
321
+ "𒄙": 184,
322
+ "𒄛": 185,
323
+ "𒄝": 186,
324
+ "𒄞": 187,
325
+ "𒄟": 188,
326
+ "𒄠": 189,
327
+ "𒄢": 190,
328
+ "𒄣": 191,
329
+ "𒄤": 192,
330
+ "𒄥": 193,
331
+ "𒄦": 194,
332
+ "𒄧": 195,
333
+ "𒄨": 196,
334
+ "𒄩": 197,
335
+ "𒄪": 198,
336
+ "𒄫": 199,
337
+ "𒄬": 200,
338
+ "𒄭": 201,
339
+ "𒄮": 202,
340
+ "𒄯": 203,
341
+ "𒄰": 204,
342
+ "𒄱": 205,
343
+ "𒄲": 206,
344
+ "𒄴": 207,
345
+ "𒄵": 208,
346
+ "𒄷": 209,
347
+ "𒄸": 210,
348
+ "𒄽": 211,
349
+ "𒄾": 212,
350
+ "𒄿": 213,
351
+ "𒅀": 214,
352
+ "𒅁": 215,
353
+ "𒅂": 216,
354
+ "𒅃": 217,
355
+ "𒅅": 218,
356
+ "𒅆": 219,
357
+ "𒅇": 220,
358
+ "𒅈": 221,
359
+ "𒅊": 222,
360
+ "𒅋": 223,
361
+ "𒅌": 224,
362
+ "𒅍": 225,
363
+ "𒅎": 226,
364
+ "𒅏": 227,
365
+ "𒅓": 228,
366
+ "𒅔": 229,
367
+ "𒅕": 230,
368
+ "𒅖": 231,
369
+ "𒅗": 232,
370
+ "𒅘": 233,
371
+ "𒅜": 234,
372
+ "𒅝": 235,
373
+ "𒅡": 236,
374
+ "𒅢": 237,
375
+ "𒅤": 238,
376
+ "𒅥": 239,
377
+ "𒅮": 240,
378
+ "𒅲": 241,
379
+ "𒅴": 242,
380
+ "𒅺": 243,
381
+ "𒅻": 244,
382
+ "𒅾": 245,
383
+ "𒅿": 246,
384
+ "𒆀": 247,
385
+ "𒆁": 248,
386
+ "𒆂": 249,
387
+ "𒆃": 250,
388
+ "𒆈": 251,
389
+ "𒆉": 252,
390
+ "𒆍": 253,
391
+ "𒆏": 254,
392
+ "𒆐": 255,
393
+ "𒆑": 256,
394
+ "𒆒": 257,
395
+ "𒆓": 258,
396
+ "𒆕": 259,
397
+ "𒆗": 260,
398
+ "𒆘": 261,
399
+ "𒆚": 262,
400
+ "𒆛": 263,
401
+ "𒆜": 264,
402
+ "𒆝": 265,
403
+ "𒆟": 266,
404
+ "𒆠": 267,
405
+ "𒆢": 268,
406
+ "𒆤": 269,
407
+ "𒆥": 270,
408
+ "𒆦": 271,
409
+ "𒆧": 272,
410
+ "𒆨": 273,
411
+ "𒆪": 274,
412
+ "𒆬": 275,
413
+ "𒆭": 276,
414
+ "𒆯": 277,
415
+ "𒆰": 278,
416
+ "𒆲": 279,
417
+ "𒆳": 280,
418
+ "𒆵": 281,
419
+ "𒆶": 282,
420
+ "𒆷": 283,
421
+ "𒆸": 284,
422
+ "𒆹": 285,
423
+ "𒆾": 286,
424
+ "𒇀": 287,
425
+ "𒇁": 288,
426
+ "𒇅": 289,
427
+ "𒇆": 290,
428
+ "𒇇": 291,
429
+ "𒇈": 292,
430
+ "𒇉": 293,
431
+ "𒇋": 294,
432
+ "𒇌": 295,
433
+ "𒇑": 296,
434
+ "𒇒": 297,
435
+ "𒇙": 298,
436
+ "𒇚": 299,
437
+ "𒇟": 300,
438
+ "𒇡": 301,
439
+ "𒇥": 302,
440
+ "𒇦": 303,
441
+ "𒇧": 304,
442
+ "𒇬": 305,
443
+ "𒇭": 306,
444
+ "𒇯": 307,
445
+ "𒇰": 308,
446
+ "𒇱": 309,
447
+ "𒇲": 310,
448
+ "𒇳": 311,
449
+ "𒇴": 312,
450
+ "𒇵": 313,
451
+ "𒇶": 314,
452
+ "𒇷": 315,
453
+ "𒇸": 316,
454
+ "𒇹": 317,
455
+ "𒇺": 318,
456
+ "𒇻": 319,
457
+ "𒇼": 320,
458
+ "𒇽": 321,
459
+ "𒇿": 322,
460
+ "𒈀": 323,
461
+ "𒈁": 324,
462
+ "𒈂": 325,
463
+ "𒈌": 326,
464
+ "𒈐": 327,
465
+ "𒈕": 328,
466
+ "𒈖": 329,
467
+ "𒈗": 330,
468
+ "𒈛": 331,
469
+ "𒈜": 332,
470
+ "𒈝": 333,
471
+ "𒈠": 334,
472
+ "𒈢": 335,
473
+ "𒈣": 336,
474
+ "𒈤": 337,
475
+ "𒈥": 338,
476
+ "𒈦": 339,
477
+ "𒈧": 340,
478
+ "𒈨": 341,
479
+ "𒈩": 342,
480
+ "𒈪": 343,
481
+ "𒈫": 344,
482
+ "𒈬": 345,
483
+ "𒈭": 346,
484
+ "𒈮": 347,
485
+ "𒈯": 348,
486
+ "𒈰": 349,
487
+ "𒈱": 350,
488
+ "𒈲": 351,
489
+ "𒈸": 352,
490
+ "𒈹": 353,
491
+ "𒈻": 354,
492
+ "𒈽": 355,
493
+ "𒈾": 356,
494
+ "𒈿": 357,
495
+ "𒉀": 358,
496
+ "𒉁": 359,
497
+ "𒉄": 360,
498
+ "𒉅": 361,
499
+ "𒉆": 362,
500
+ "𒉇": 363,
501
+ "𒉈": 364,
502
+ "𒉋": 365,
503
+ "𒉌": 366,
504
+ "𒉎": 367,
505
+ "𒉏": 368,
506
+ "𒉐": 369,
507
+ "𒉒": 370,
508
+ "𒉓": 371,
509
+ "𒉖": 372,
510
+ "𒉘": 373,
511
+ "𒉚": 374,
512
+ "𒉠": 375,
513
+ "𒉡": 376,
514
+ "𒉢": 377,
515
+ "𒉣": 378,
516
+ "𒉥": 379,
517
+ "𒉦": 380,
518
+ "𒉩": 381,
519
+ "𒉪": 382,
520
+ "𒉭": 383,
521
+ "𒉮": 384,
522
+ "𒉯": 385,
523
+ "𒉴": 386,
524
+ "𒉵": 387,
525
+ "𒉶": 388,
526
+ "𒉺": 389,
527
+ "𒉻": 390,
528
+ "𒉼": 391,
529
+ "𒉽": 392,
530
+ "𒉾": 393,
531
+ "𒉿": 394,
532
+ "𒊊": 395,
533
+ "𒊌": 396,
534
+ "𒊍": 397,
535
+ "𒊎": 398,
536
+ "𒊏": 399,
537
+ "𒊐": 400,
538
+ "𒊑": 401,
539
+ "𒊒": 402,
540
+ "𒊓": 403,
541
+ "𒊔": 404,
542
+ "𒊕": 405,
543
+ "𒊗": 406,
544
+ "𒊙": 407,
545
+ "𒊚": 408,
546
+ "𒊠": 409,
547
+ "𒊢": 410,
548
+ "𒊨": 411,
549
+ "𒊩": 412,
550
+ "𒊫": 413,
551
+ "𒊬": 414,
552
+ "𒊭": 415,
553
+ "𒊮": 416,
554
+ "𒊯": 417,
555
+ "𒊲": 418,
556
+ "𒊴": 419,
557
+ "𒊷": 420,
558
+ "𒊹": 421,
559
+ "𒊺": 422,
560
+ "𒊻": 423,
561
+ "𒊽": 424,
562
+ "𒊾": 425,
563
+ "𒊿": 426,
564
+ "𒋀": 427,
565
+ "𒋁": 428,
566
+ "𒋃": 429,
567
+ "𒋄": 430,
568
+ "𒋆": 431,
569
+ "𒋇": 432,
570
+ "𒋋": 433,
571
+ "𒋍": 434,
572
+ "𒋎": 435,
573
+ "𒋒": 436,
574
+ "𒋓": 437,
575
+ "𒋖": 438,
576
+ "𒋗": 439,
577
+ "𒋙": 440,
578
+ "𒋚": 441,
579
+ "𒋛": 442,
580
+ "𒋜": 443,
581
+ "𒋝": 444,
582
+ "𒋞": 445,
583
+ "𒋠": 446,
584
+ "𒋡": 447,
585
+ "𒋢": 448,
586
+ "𒋤": 449,
587
+ "𒋥": 450,
588
+ "𒋦": 451,
589
+ "𒋧": 452,
590
+ "𒋨": 453,
591
+ "𒋩": 454,
592
+ "𒋪": 455,
593
+ "𒋫": 456,
594
+ "𒋭": 457,
595
+ "𒋰": 458,
596
+ "𒋳": 459,
597
+ "𒋷": 460,
598
+ "𒋸": 461,
599
+ "𒋺": 462,
600
+ "𒋻": 463,
601
+ "𒋼": 464,
602
+ "𒋽": 465,
603
+ "𒋾": 466,
604
+ "𒌀": 467,
605
+ "𒌁": 468,
606
+ "𒌃": 469,
607
+ "𒌄": 470,
608
+ "𒌅": 471,
609
+ "𒌆": 472,
610
+ "𒌇": 473,
611
+ "𒌈": 474,
612
+ "𒌉": 475,
613
+ "𒌋": 476,
614
+ "𒌌": 477,
615
+ "𒌍": 478,
616
+ "𒌏": 479,
617
+ "𒌑": 480,
618
+ "𒌒": 481,
619
+ "𒌓": 482,
620
+ "𒌔": 483,
621
+ "𒌕": 484,
622
+ "𒌗": 485,
623
+ "𒌙": 486,
624
+ "𒌚": 487,
625
+ "𒌛": 488,
626
+ "𒌜": 489,
627
+ "𒌝": 490,
628
+ "𒌢": 491,
629
+ "𒌣": 492,
630
+ "𒌤": 493,
631
+ "𒌦": 494,
632
+ "𒌧": 495,
633
+ "𒌨": 496,
634
+ "𒌪": 497,
635
+ "𒌫": 498,
636
+ "𒌬": 499,
637
+ "𒌰": 500,
638
+ "𒌱": 501,
639
+ "𒌲": 502,
640
+ "𒌴": 503,
641
+ "𒌵": 504,
642
+ "𒌶": 505,
643
+ "𒌷": 506,
644
+ "𒌸": 507,
645
+ "𒌺": 508,
646
+ "𒌼": 509,
647
+ "𒌾": 510,
648
+ "𒌿": 511,
649
+ "𒍀": 512,
650
+ "𒍂": 513,
651
+ "𒍇": 514,
652
+ "𒍋": 515,
653
+ "𒍍": 516,
654
+ "𒍎": 517,
655
+ "𒍏": 518,
656
+ "𒍑": 519,
657
+ "𒍒": 520,
658
+ "𒍔": 521,
659
+ "𒍕": 522,
660
+ "𒍖": 523,
661
+ "𒍚": 524,
662
+ "𒍜": 525,
663
+ "𒍝": 526,
664
+ "𒍞": 527,
665
+ "𒍠": 528,
666
+ "𒍢": 529,
667
+ "𒍣": 530,
668
+ "𒍤": 531,
669
+ "𒍥": 532,
670
+ "𒍦": 533,
671
+ "𒍨": 534,
672
+ "𒍩": 535,
673
+ "𒍪": 536,
674
+ "𒍫": 537,
675
+ "𒍬": 538,
676
+ "𒍮": 539,
677
+ "𒍺": 540,
678
+ "𒍼": 541,
679
+ "𒍽": 542,
680
+ "𒎉": 543,
681
+ "𒎌": 544,
682
+ "𒎎": 545,
683
+ "𒎏": 546,
684
+ "𒎐": 547,
685
+ "𒎒": 548,
686
+ "𒎓": 549,
687
+ "𒎗": 550,
688
+ "𒎙": 551,
689
+ "𒐀": 552,
690
+ "𒐁": 553,
691
+ "𒐂": 554,
692
+ "𒐃": 555,
693
+ "𒐄": 556,
694
+ "𒐅": 557,
695
+ "𒐆": 558,
696
+ "𒐇": 559,
697
+ "𒐈": 560,
698
+ "𒐉": 561,
699
+ "𒐊": 562,
700
+ "𒐋": 563,
701
+ "𒐌": 564,
702
+ "𒐍": 565,
703
+ "𒐏": 566,
704
+ "𒐐": 567,
705
+ "𒐑": 568,
706
+ "𒐒": 569,
707
+ "𒐓": 570,
708
+ "𒐔": 571,
709
+ "𒐕": 572,
710
+ "𒐖": 573,
711
+ "𒐗": 574,
712
+ "𒐘": 575,
713
+ "𒐙": 576,
714
+ "𒐚": 577,
715
+ "𒐛": 578,
716
+ "𒐜": 579,
717
+ "𒐝": 580,
718
+ "𒐞": 581,
719
+ "𒐟": 582,
720
+ "𒐠": 583,
721
+ "𒐡": 584,
722
+ "𒐢": 585,
723
+ "𒐣": 586,
724
+ "𒐤": 587,
725
+ "𒐦": 588,
726
+ "𒐧": 589,
727
+ "𒐨": 590,
728
+ "𒐩": 591,
729
+ "𒐪": 592,
730
+ "𒐫": 593,
731
+ "𒐬": 594,
732
+ "𒐭": 595,
733
+ "𒐮": 596,
734
+ "𒐰": 597,
735
+ "𒐱": 598,
736
+ "𒐴": 599,
737
+ "𒐵": 600,
738
+ "𒐶": 601,
739
+ "𒐸": 602,
740
+ "𒐹": 603,
741
+ "𒐼": 604,
742
+ "𒑄": 605,
743
+ "𒑆": 606,
744
+ "𒑍": 607,
745
+ "𒑏": 608,
746
+ "𒑐": 609,
747
+ "𒑑": 610,
748
+ "𒑒": 611,
749
+ "𒑔": 612,
750
+ "𒑖": 613,
751
+ "𒑗": 614,
752
+ "𒑘": 615,
753
+ "𒑙": 616,
754
+ "𒑚": 617,
755
+ "𒑛": 618,
756
+ "𒑜": 619,
757
+ "𒑟": 620,
758
+ "𒑠": 621,
759
+ "𒒬": 622,
760
+ "𒒾": 623,
761
+ "𒓈": 624,
762
+ "𒓺": 625,
763
+ "𒓻": 626,
764
+ "𒓼": 627,
765
+ "𒔲": 628,
766
+ "𒔸": 629,
767
+ "𒕁": 630,
768
+ "𒕂": 631,
769
+ "𒌨𒀭": 632,
770
+ "𒀭𒂗": 633,
771
+ "𒐊𒋡": 634,
772
+ "𒋗𒃸": 635,
773
+ "𒆠𒁀": 636,
774
+ "𒀭𒂗𒍪": 637,
775
+ "𒀭𒎏": 638,
776
+ "𒅆𒌨": 639
777
+ },
778
+ "merges": [
779
+ "𒌨 𒀭",
780
+ "𒀭 𒂗",
781
+ "𒐊 𒋡",
782
+ "𒋗 𒃸",
783
+ "𒆠 𒁀",
784
+ "𒀭𒂗 𒍪",
785
+ "𒀭 𒎏",
786
+ "𒅆 𒌨"
787
+ ]
788
+ }
789
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "\n",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "<SURFACE>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "<COLUMN>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "8": {
68
+ "content": "<BLANK_SPACE>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "9": {
76
+ "content": "<RULING>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "10": {
84
+ "content": "...",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ }
91
+ },
92
+ "additional_special_tokens": [
93
+ "</s>",
94
+ "<RULING>",
95
+ "<mask>",
96
+ "<BLANK_SPACE>",
97
+ "<unk>",
98
+ "\n",
99
+ "<s>",
100
+ "<COLUMN>",
101
+ "<SURFACE>",
102
+ "...",
103
+ "<pad>"
104
+ ],
105
+ "bos_token": "<s>",
106
+ "clean_up_tokenization_spaces": true,
107
+ "cls_token": "<s>",
108
+ "eos_token": "</s>",
109
+ "mask_token": "<mask>",
110
+ "model_max_length": 1000000000000000019884624838656,
111
+ "pad_token": "<pad>",
112
+ "sep_token": "</s>",
113
+ "tokenizer_class": "PreTrainedTokenizerFast",
114
+ "unk_token": "<unk>"
115
+ }