asprenger commited on
Commit
c4df5b5
1 Parent(s): cf2112c

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,407 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-base-en-v1.5
3
+ datasets: []
4
+ language:
5
+ - en
6
+ library_name: sentence-transformers
7
+ license: apache-2.0
8
+ pipeline_tag: sentence-similarity
9
+ tags:
10
+ - sentence-transformers
11
+ - sentence-similarity
12
+ - feature-extraction
13
+ - generated_from_trainer
14
+ - dataset_size:6300
15
+ - loss:MatryoshkaLoss
16
+ - loss:MultipleNegativesRankingLoss
17
+ widget:
18
+ - source_sentence: Additionally, we sell the ALLDATA brand of automotive diagnostic,
19
+ repair, collision, and shop management software through www.alldata.com.
20
+ sentences:
21
+ - What was the percentage change in total revenue from fiscal year 2022 to fiscal
22
+ year 2023?
23
+ - What type of software is sold through the website alldata.com?
24
+ - What tax implications apply to the future repatriation of incremental undistributed
25
+ earnings by a REIT from its foreign subsidiaries?
26
+ - source_sentence: As of October 31, 2023, the principal payments on debt were $9,585
27
+ million in total, with $216 million due as short-term and $9,369 million as long-term
28
+ obligations.
29
+ sentences:
30
+ - How is mall net operating income (NOI) useful in evaluating mall operating performance?
31
+ - What are the principal amounts for short-term and long-term debts to be paid as
32
+ of October 31, 2023?
33
+ - What was the main reason for the increase in the company's valuation allowance
34
+ during fiscal 2023?
35
+ - source_sentence: Revenues derived in the United States were $22,007 million, $18,749
36
+ million and $17,363 million for the fiscal years ended May 31, 2023, 2022 and
37
+ 2021, respectively.
38
+ sentences:
39
+ - What were the revenues derived in the United States for the fiscal years ended
40
+ May 31, 2021, 2022, and 2023?
41
+ - When was the auditors' report on the financial statements dated?
42
+ - What was Richard C. Smith's role in AutoZone before being named Senior Vice President
43
+ – Human Resources?
44
+ - source_sentence: As of December 31, 2023, the accrued employee compensation and
45
+ benefits amounted to $592 million.
46
+ sentences:
47
+ - What was the outcome of the litigation regarding the Vermont net neutrality statute?
48
+ - What are included immediately following Part IV in the Annual Report on Form 10-K?
49
+ - What was the accrued employee compensation and benefits as of December 31, 2023?
50
+ - source_sentence: Diluted earnings per share were $16.69 in fiscal 2022 compared
51
+ to $15.53 in fiscal 2021.
52
+ sentences:
53
+ - By what amount did fiscal 2022 diluted earnings per share increase from fiscal
54
+ 2021?
55
+ - What are the total noncancelable purchase commitments as of December 31, 2023,
56
+ and how are they distributed over different time periods?
57
+ - What is the purpose of internal control over financial reporting as implemented
58
+ by a company?
59
+ ---
60
+
61
+ # BGE base Financial Matryoshka
62
+
63
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
64
+
65
+ ## Model Details
66
+
67
+ ### Model Description
68
+ - **Model Type:** Sentence Transformer
69
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
70
+ - **Maximum Sequence Length:** 512 tokens
71
+ - **Output Dimensionality:** 768 tokens
72
+ - **Similarity Function:** Cosine Similarity
73
+ <!-- - **Training Dataset:** Unknown -->
74
+ - **Language:** en
75
+ - **License:** apache-2.0
76
+
77
+ ### Model Sources
78
+
79
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
80
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
81
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
82
+
83
+ ### Full Model Architecture
84
+
85
+ ```
86
+ SentenceTransformer(
87
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
88
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
89
+ (2): Normalize()
90
+ )
91
+ ```
92
+
93
+ ## Usage
94
+
95
+ ### Direct Usage (Sentence Transformers)
96
+
97
+ First install the Sentence Transformers library:
98
+
99
+ ```bash
100
+ pip install -U sentence-transformers
101
+ ```
102
+
103
+ Then you can load this model and run inference.
104
+ ```python
105
+ from sentence_transformers import SentenceTransformer
106
+
107
+ # Download from the 🤗 Hub
108
+ model = SentenceTransformer("asprenger/bge-base-financial-matryoshka")
109
+ # Run inference
110
+ sentences = [
111
+ 'Diluted earnings per share were $16.69 in fiscal 2022 compared to $15.53 in fiscal 2021.',
112
+ 'By what amount did fiscal 2022 diluted earnings per share increase from fiscal 2021?',
113
+ 'What are the total noncancelable purchase commitments as of December 31, 2023, and how are they distributed over different time periods?',
114
+ ]
115
+ embeddings = model.encode(sentences)
116
+ print(embeddings.shape)
117
+ # [3, 768]
118
+
119
+ # Get the similarity scores for the embeddings
120
+ similarities = model.similarity(embeddings, embeddings)
121
+ print(similarities.shape)
122
+ # [3, 3]
123
+ ```
124
+
125
+ <!--
126
+ ### Direct Usage (Transformers)
127
+
128
+ <details><summary>Click to see the direct usage in Transformers</summary>
129
+
130
+ </details>
131
+ -->
132
+
133
+ <!--
134
+ ### Downstream Usage (Sentence Transformers)
135
+
136
+ You can finetune this model on your own dataset.
137
+
138
+ <details><summary>Click to expand</summary>
139
+
140
+ </details>
141
+ -->
142
+
143
+ <!--
144
+ ### Out-of-Scope Use
145
+
146
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
147
+ -->
148
+
149
+ <!--
150
+ ## Bias, Risks and Limitations
151
+
152
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
153
+ -->
154
+
155
+ <!--
156
+ ### Recommendations
157
+
158
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
159
+ -->
160
+
161
+ ## Training Details
162
+
163
+ ### Training Dataset
164
+
165
+ #### Unnamed Dataset
166
+
167
+
168
+ * Size: 6,300 training samples
169
+ * Columns: <code>positive</code> and <code>anchor</code>
170
+ * Approximate statistics based on the first 1000 samples:
171
+ | | positive | anchor |
172
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
173
+ | type | string | string |
174
+ | details | <ul><li>min: 6 tokens</li><li>mean: 44.8 tokens</li><li>max: 260 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 20.57 tokens</li><li>max: 45 tokens</li></ul> |
175
+ * Samples:
176
+ | positive | anchor |
177
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------|
178
+ | <code>Item 8 includes Financial Statements and Supplementary Data.</code> | <code>What does Item 8 in a financial document typically contain?</code> |
179
+ | <code>Intellectual property rights are important to Nike's brand, success, and competitive position. The company strategically pursues protections of these rights and vigorously protects them against third-party theft and infringement.</code> | <code>What role does intellectual property play in Nike's competitive position?</code> |
180
+ | <code>If the Company determines that it is more likely than not that the fair value of the reporting unit is less than its carrying amount, then the Company is required to perform a quantitative assessment for impairment. Under the quantitative goodwill impairment test, if a reporting unit’s carrying amount exceeds its fair value, an impairment loss is recognized in an amount equal to the excess, not to exceed the total amount of goodwill allocated to that reporting unit.</code> | <code>What process does a company follow if it determines that the fair value of a reporting unit is less than its carrying amount?</code> |
181
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
182
+ ```json
183
+ {
184
+ "loss": "MultipleNegativesRankingLoss",
185
+ "matryoshka_dims": [
186
+ 768,
187
+ 512,
188
+ 256,
189
+ 128,
190
+ 64
191
+ ],
192
+ "matryoshka_weights": [
193
+ 1,
194
+ 1,
195
+ 1,
196
+ 1,
197
+ 1
198
+ ],
199
+ "n_dims_per_step": -1
200
+ }
201
+ ```
202
+
203
+ ### Training Hyperparameters
204
+ #### Non-Default Hyperparameters
205
+
206
+ - `per_device_train_batch_size`: 32
207
+ - `per_device_eval_batch_size`: 16
208
+ - `gradient_accumulation_steps`: 16
209
+ - `learning_rate`: 2e-05
210
+ - `num_train_epochs`: 4
211
+ - `lr_scheduler_type`: cosine
212
+ - `warmup_ratio`: 0.1
213
+ - `bf16`: True
214
+ - `tf32`: True
215
+ - `optim`: adamw_torch_fused
216
+ - `batch_sampler`: no_duplicates
217
+
218
+ #### All Hyperparameters
219
+ <details><summary>Click to expand</summary>
220
+
221
+ - `overwrite_output_dir`: False
222
+ - `do_predict`: False
223
+ - `eval_strategy`: no
224
+ - `prediction_loss_only`: True
225
+ - `per_device_train_batch_size`: 32
226
+ - `per_device_eval_batch_size`: 16
227
+ - `per_gpu_train_batch_size`: None
228
+ - `per_gpu_eval_batch_size`: None
229
+ - `gradient_accumulation_steps`: 16
230
+ - `eval_accumulation_steps`: None
231
+ - `learning_rate`: 2e-05
232
+ - `weight_decay`: 0.0
233
+ - `adam_beta1`: 0.9
234
+ - `adam_beta2`: 0.999
235
+ - `adam_epsilon`: 1e-08
236
+ - `max_grad_norm`: 1.0
237
+ - `num_train_epochs`: 4
238
+ - `max_steps`: -1
239
+ - `lr_scheduler_type`: cosine
240
+ - `lr_scheduler_kwargs`: {}
241
+ - `warmup_ratio`: 0.1
242
+ - `warmup_steps`: 0
243
+ - `log_level`: passive
244
+ - `log_level_replica`: warning
245
+ - `log_on_each_node`: True
246
+ - `logging_nan_inf_filter`: True
247
+ - `save_safetensors`: True
248
+ - `save_on_each_node`: False
249
+ - `save_only_model`: False
250
+ - `restore_callback_states_from_checkpoint`: False
251
+ - `no_cuda`: False
252
+ - `use_cpu`: False
253
+ - `use_mps_device`: False
254
+ - `seed`: 42
255
+ - `data_seed`: None
256
+ - `jit_mode_eval`: False
257
+ - `use_ipex`: False
258
+ - `bf16`: True
259
+ - `fp16`: False
260
+ - `fp16_opt_level`: O1
261
+ - `half_precision_backend`: auto
262
+ - `bf16_full_eval`: False
263
+ - `fp16_full_eval`: False
264
+ - `tf32`: True
265
+ - `local_rank`: 0
266
+ - `ddp_backend`: None
267
+ - `tpu_num_cores`: None
268
+ - `tpu_metrics_debug`: False
269
+ - `debug`: []
270
+ - `dataloader_drop_last`: False
271
+ - `dataloader_num_workers`: 0
272
+ - `dataloader_prefetch_factor`: None
273
+ - `past_index`: -1
274
+ - `disable_tqdm`: False
275
+ - `remove_unused_columns`: True
276
+ - `label_names`: None
277
+ - `load_best_model_at_end`: False
278
+ - `ignore_data_skip`: False
279
+ - `fsdp`: []
280
+ - `fsdp_min_num_params`: 0
281
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
282
+ - `fsdp_transformer_layer_cls_to_wrap`: None
283
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
284
+ - `deepspeed`: None
285
+ - `label_smoothing_factor`: 0.0
286
+ - `optim`: adamw_torch_fused
287
+ - `optim_args`: None
288
+ - `adafactor`: False
289
+ - `group_by_length`: False
290
+ - `length_column_name`: length
291
+ - `ddp_find_unused_parameters`: None
292
+ - `ddp_bucket_cap_mb`: None
293
+ - `ddp_broadcast_buffers`: False
294
+ - `dataloader_pin_memory`: True
295
+ - `dataloader_persistent_workers`: False
296
+ - `skip_memory_metrics`: True
297
+ - `use_legacy_prediction_loop`: False
298
+ - `push_to_hub`: False
299
+ - `resume_from_checkpoint`: None
300
+ - `hub_model_id`: None
301
+ - `hub_strategy`: every_save
302
+ - `hub_private_repo`: False
303
+ - `hub_always_push`: False
304
+ - `gradient_checkpointing`: False
305
+ - `gradient_checkpointing_kwargs`: None
306
+ - `include_inputs_for_metrics`: False
307
+ - `eval_do_concat_batches`: True
308
+ - `fp16_backend`: auto
309
+ - `push_to_hub_model_id`: None
310
+ - `push_to_hub_organization`: None
311
+ - `mp_parameters`:
312
+ - `auto_find_batch_size`: False
313
+ - `full_determinism`: False
314
+ - `torchdynamo`: None
315
+ - `ray_scope`: last
316
+ - `ddp_timeout`: 1800
317
+ - `torch_compile`: False
318
+ - `torch_compile_backend`: None
319
+ - `torch_compile_mode`: None
320
+ - `dispatch_batches`: None
321
+ - `split_batches`: None
322
+ - `include_tokens_per_second`: False
323
+ - `include_num_input_tokens_seen`: False
324
+ - `neftune_noise_alpha`: None
325
+ - `optim_target_modules`: None
326
+ - `batch_eval_metrics`: False
327
+ - `batch_sampler`: no_duplicates
328
+ - `multi_dataset_batch_sampler`: proportional
329
+
330
+ </details>
331
+
332
+ ### Training Logs
333
+ | Epoch | Step | Training Loss |
334
+ |:------:|:----:|:-------------:|
335
+ | 0.8122 | 10 | 1.5567 |
336
+ | 1.6244 | 20 | 0.6836 |
337
+ | 2.4365 | 30 | 0.473 |
338
+ | 3.2487 | 40 | 0.3917 |
339
+
340
+
341
+ ### Framework Versions
342
+ - Python: 3.10.12
343
+ - Sentence Transformers: 3.0.1
344
+ - Transformers: 4.41.2
345
+ - PyTorch: 2.1.2+cu121
346
+ - Accelerate: 0.32.1
347
+ - Datasets: 2.19.1
348
+ - Tokenizers: 0.19.1
349
+
350
+ ## Citation
351
+
352
+ ### BibTeX
353
+
354
+ #### Sentence Transformers
355
+ ```bibtex
356
+ @inproceedings{reimers-2019-sentence-bert,
357
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
358
+ author = "Reimers, Nils and Gurevych, Iryna",
359
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
360
+ month = "11",
361
+ year = "2019",
362
+ publisher = "Association for Computational Linguistics",
363
+ url = "https://arxiv.org/abs/1908.10084",
364
+ }
365
+ ```
366
+
367
+ #### MatryoshkaLoss
368
+ ```bibtex
369
+ @misc{kusupati2024matryoshka,
370
+ title={Matryoshka Representation Learning},
371
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
372
+ year={2024},
373
+ eprint={2205.13147},
374
+ archivePrefix={arXiv},
375
+ primaryClass={cs.LG}
376
+ }
377
+ ```
378
+
379
+ #### MultipleNegativesRankingLoss
380
+ ```bibtex
381
+ @misc{henderson2017efficient,
382
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
383
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
384
+ year={2017},
385
+ eprint={1705.00652},
386
+ archivePrefix={arXiv},
387
+ primaryClass={cs.CL}
388
+ }
389
+ ```
390
+
391
+ <!--
392
+ ## Glossary
393
+
394
+ *Clearly define terms in order to be accessible across audiences.*
395
+ -->
396
+
397
+ <!--
398
+ ## Model Card Authors
399
+
400
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
401
+ -->
402
+
403
+ <!--
404
+ ## Model Card Contact
405
+
406
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
407
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-base-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.1.2+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec4b2b716220a597ca340d50aa35372588beb5b29e5229d84e36f7358125228b
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff