MugheesAwan11 commited on
Commit
7cdecc1
1 Parent(s): fe75d99

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,756 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: BAAI/bge-base-en-v1.5
3
+ datasets: []
4
+ language:
5
+ - en
6
+ library_name: sentence-transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - cosine_accuracy@1
10
+ - cosine_accuracy@3
11
+ - cosine_accuracy@5
12
+ - cosine_accuracy@10
13
+ - cosine_precision@1
14
+ - cosine_precision@3
15
+ - cosine_precision@5
16
+ - cosine_precision@10
17
+ - cosine_recall@1
18
+ - cosine_recall@3
19
+ - cosine_recall@5
20
+ - cosine_recall@10
21
+ - cosine_ndcg@10
22
+ - cosine_mrr@10
23
+ - cosine_map@100
24
+ pipeline_tag: sentence-similarity
25
+ tags:
26
+ - sentence-transformers
27
+ - sentence-similarity
28
+ - feature-extraction
29
+ - generated_from_trainer
30
+ - dataset_size:1872
31
+ - loss:MatryoshkaLoss
32
+ - loss:MultipleNegativesRankingLoss
33
+ widget:
34
+ - source_sentence: The Secretary of Health and Human.pathname_key_services may issue
35
+ an Emergency Use Authorization (EUA) to authorize unapproved medical products,
36
+ or unapproved uses of approved medical products, to be manufactured, marketed,
37
+ and sold in the context of an actual or potential emergency designated by the
38
+ government.
39
+ sentences:
40
+ - What was the aggregate intrinsic value of exercised stock options as of December
41
+ 30, 2023?
42
+ - What are some of the regulations related to data breach impact analysis and response?
43
+ - What does the Emergency Use Authorization (EUA) by the U.S. Secretary of Health
44
+ and Human Services allow?
45
+ - source_sentence: 'the Virginia Consumer Data Protection Act protect consumers? The
46
+ Virginia Consumer Data Protection Act protects consumers by prohibiting deceptive
47
+ and unfair trade practices, giving consumers the right to sue for damages, and
48
+ providing a mechanism for enforcement against businesses engaging in such practices.
49
+ ## Join Our Newsletter Get all the latest information, law updates and more delivered
50
+ to your inbox ### Share Copy 54 ### More Stories that May Interest You View More
51
+ September 21, 2023 ## Navigating Generative AI Privacy Challenges & Safeguarding
52
+ Tips Introduction The emergence of Generative AI has ushered in a new era of innovation
53
+ in the ever-evolving technological landscape that pushes the boundaries of...
54
+ View More September 13, 2023 ## Kuwait''s DPPR Kuwait didn’t have any data protection
55
+ law until the Communication and Information Technology Regulatory Authority (CITRA)
56
+ introduced the Data Privacy Protection Regulation'
57
+ sentences:
58
+ - What is Securiti's mission and history regarding Italy's GDPR implementation and
59
+ compliance?
60
+ - Which states have enacted data privacy laws like the VCDPA?
61
+ - How does the Virginia Consumer Data Protection Act protect consumers and how is
62
+ this protection enforced?
63
+ - source_sentence: Data Flow Intelligence & Governance Prevent sensitive data sprawl
64
+ through real-time streaming platforms Learn more Data Consent Automation First
65
+ Party Consent | Third Party & Cookie Consent Learn more Data Security Posture
66
+ Management Secure sensitive data in hybrid multicloud and SaaS environments Learn
67
+ more Data Breach Impact Analysis & Response Analyze impact of a data breach and
68
+ coordinate response per global regulatory obligations Learn more Data Catalog
69
+ Automatically catalog datasets and enable users to find, understand, trust and
70
+ access data Learn more Data Lineage Track changes and transformations of data
71
+ throughout its lifecycle Data Controls Orchestrator View Data Command Center View
72
+ Sensitive Data Intelligence View Asset Discovery Data Discovery & Classification
73
+ Sensitive Data Catalog People Data Graph Learn more Privacy , Sensitive Data
74
+ Intelligence Discover & Classify Structured and Unstructured Data | People Data
75
+ Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl
76
+ through real-time streaming platforms Learn more Data Consent Automation First
77
+ Party Consent | Third Party & Cookie Consent Learn more Data Security Posture
78
+ Management Secure sensitive data in hybrid multicloud and SaaS environments Learn
79
+ more Data Breach Impact Analysis & Response Analyze impact of a data breach and
80
+ coordinate response per global regulatory obligations Learn more Data Catalog
81
+ Automatically catalog datasets and enable users to find, understand, trust and
82
+ access data Learn more Data Lineage Track changes and transformations of data
83
+ throughout its lifecycle Data Controls Orchestrator View Data Command Center View
84
+ Sensitive Data Intelligence View
85
+ sentences:
86
+ - Why is it important to manage security of sensitive data in hybrid multicloud
87
+ and SaaS environments, prevent data sprawl, and analyze the impact of data breaches?
88
+ - What right does the consumer have regarding their personal data in terms of deletion?
89
+ - What is the legal basis for the LGPD in Brazil?
90
+ - source_sentence: its lifecycle Data Controls Orchestrator View Data Command Center
91
+ View Sensitive Data Intelligence View Asset Discovery Data Discovery & Classification
92
+ Sensitive Data Catalog People Data Graph Learn more Privacy Automate compliance
93
+ with global privacy regulations Data Mapping Automation View Data Subject Request
94
+ Automation View People Data Graph View Assessment Automation View Cookie Consent
95
+ View Universal Consent View Vendor Risk Assessment View Breach Management View
96
+ Privacy Policy Management View Privacy Center View Learn more Security Identify
97
+ data risk and enable protection & control Data Security Posture Management View
98
+ Data Access Intelligence & Governance View Data Risk Management View
99
+ sentences:
100
+ - What is ANPD's primary goal regarding LGPD and its rights and regulations?
101
+ - What options are there for joining the Securiti team and expanding knowledge in
102
+ data privacy, security, and governance?
103
+ - How does the Data Controls Orchestrator help automate compliance with global privacy
104
+ regulations?
105
+ - source_sentence: 'remediate the incident, promptly notify relevant individuals,
106
+ and report such data security incidents to the regulatory department(s). Thus,
107
+ you should have a robust security breach response mechanism in place. ## 7\. Cross
108
+ border data transfer and data localization requirements: Under DSL, Critical Information
109
+ Infrastructure Operators are required to store the important data in the territory
110
+ of China and cross-border transfer is regulated by the CSL. CIIOs need to conduct
111
+ a security assessment in accordance with the measures jointly defined by CAC and
112
+ the relevant departments under the State Council for the cross-border transfer
113
+ of important data for business necessity. For non Critical Information Infrastructure
114
+ operators, the important data cross-border transfer will be regulated by the measures
115
+ announced by the Cyberspace Administration of China (CAC) and other authorities.
116
+ However, those “measures” have still not yet been released. DSL also intends to
117
+ establish a data national security review and export control system to restrict
118
+ the cross-border transmission of data'
119
+ sentences:
120
+ - What are the requirements for storing important data in the territory of China
121
+ under DSL?
122
+ - How does behavioral targeting relate to the processing of personal data under
123
+ Bahrain PDPL?
124
+ - What is the margin of error generally estimated for worldwide Monthly Active People
125
+ (MAP)?
126
+ model-index:
127
+ - name: SentenceTransformer based on BAAI/bge-base-en-v1.5
128
+ results:
129
+ - task:
130
+ type: information-retrieval
131
+ name: Information Retrieval
132
+ dataset:
133
+ name: dim 768
134
+ type: dim_768
135
+ metrics:
136
+ - type: cosine_accuracy@1
137
+ value: 0.27835051546391754
138
+ name: Cosine Accuracy@1
139
+ - type: cosine_accuracy@3
140
+ value: 0.5463917525773195
141
+ name: Cosine Accuracy@3
142
+ - type: cosine_accuracy@5
143
+ value: 0.6494845360824743
144
+ name: Cosine Accuracy@5
145
+ - type: cosine_accuracy@10
146
+ value: 0.7835051546391752
147
+ name: Cosine Accuracy@10
148
+ - type: cosine_precision@1
149
+ value: 0.27835051546391754
150
+ name: Cosine Precision@1
151
+ - type: cosine_precision@3
152
+ value: 0.18213058419243983
153
+ name: Cosine Precision@3
154
+ - type: cosine_precision@5
155
+ value: 0.12989690721649483
156
+ name: Cosine Precision@5
157
+ - type: cosine_precision@10
158
+ value: 0.07835051546391751
159
+ name: Cosine Precision@10
160
+ - type: cosine_recall@1
161
+ value: 0.27835051546391754
162
+ name: Cosine Recall@1
163
+ - type: cosine_recall@3
164
+ value: 0.5463917525773195
165
+ name: Cosine Recall@3
166
+ - type: cosine_recall@5
167
+ value: 0.6494845360824743
168
+ name: Cosine Recall@5
169
+ - type: cosine_recall@10
170
+ value: 0.7835051546391752
171
+ name: Cosine Recall@10
172
+ - type: cosine_ndcg@10
173
+ value: 0.5204365648204007
174
+ name: Cosine Ndcg@10
175
+ - type: cosine_mrr@10
176
+ value: 0.4373834069710358
177
+ name: Cosine Mrr@10
178
+ - type: cosine_map@100
179
+ value: 0.44377152224424676
180
+ name: Cosine Map@100
181
+ - task:
182
+ type: information-retrieval
183
+ name: Information Retrieval
184
+ dataset:
185
+ name: dim 512
186
+ type: dim_512
187
+ metrics:
188
+ - type: cosine_accuracy@1
189
+ value: 0.28865979381443296
190
+ name: Cosine Accuracy@1
191
+ - type: cosine_accuracy@3
192
+ value: 0.5463917525773195
193
+ name: Cosine Accuracy@3
194
+ - type: cosine_accuracy@5
195
+ value: 0.6597938144329897
196
+ name: Cosine Accuracy@5
197
+ - type: cosine_accuracy@10
198
+ value: 0.7731958762886598
199
+ name: Cosine Accuracy@10
200
+ - type: cosine_precision@1
201
+ value: 0.28865979381443296
202
+ name: Cosine Precision@1
203
+ - type: cosine_precision@3
204
+ value: 0.18213058419243983
205
+ name: Cosine Precision@3
206
+ - type: cosine_precision@5
207
+ value: 0.1319587628865979
208
+ name: Cosine Precision@5
209
+ - type: cosine_precision@10
210
+ value: 0.07731958762886597
211
+ name: Cosine Precision@10
212
+ - type: cosine_recall@1
213
+ value: 0.28865979381443296
214
+ name: Cosine Recall@1
215
+ - type: cosine_recall@3
216
+ value: 0.5463917525773195
217
+ name: Cosine Recall@3
218
+ - type: cosine_recall@5
219
+ value: 0.6597938144329897
220
+ name: Cosine Recall@5
221
+ - type: cosine_recall@10
222
+ value: 0.7731958762886598
223
+ name: Cosine Recall@10
224
+ - type: cosine_ndcg@10
225
+ value: 0.5234913842554121
226
+ name: Cosine Ndcg@10
227
+ - type: cosine_mrr@10
228
+ value: 0.4444403534609721
229
+ name: Cosine Mrr@10
230
+ - type: cosine_map@100
231
+ value: 0.45150068207403454
232
+ name: Cosine Map@100
233
+ - task:
234
+ type: information-retrieval
235
+ name: Information Retrieval
236
+ dataset:
237
+ name: dim 256
238
+ type: dim_256
239
+ metrics:
240
+ - type: cosine_accuracy@1
241
+ value: 0.26804123711340205
242
+ name: Cosine Accuracy@1
243
+ - type: cosine_accuracy@3
244
+ value: 0.4845360824742268
245
+ name: Cosine Accuracy@3
246
+ - type: cosine_accuracy@5
247
+ value: 0.6494845360824743
248
+ name: Cosine Accuracy@5
249
+ - type: cosine_accuracy@10
250
+ value: 0.7628865979381443
251
+ name: Cosine Accuracy@10
252
+ - type: cosine_precision@1
253
+ value: 0.26804123711340205
254
+ name: Cosine Precision@1
255
+ - type: cosine_precision@3
256
+ value: 0.16151202749140892
257
+ name: Cosine Precision@3
258
+ - type: cosine_precision@5
259
+ value: 0.12989690721649483
260
+ name: Cosine Precision@5
261
+ - type: cosine_precision@10
262
+ value: 0.07628865979381441
263
+ name: Cosine Precision@10
264
+ - type: cosine_recall@1
265
+ value: 0.26804123711340205
266
+ name: Cosine Recall@1
267
+ - type: cosine_recall@3
268
+ value: 0.4845360824742268
269
+ name: Cosine Recall@3
270
+ - type: cosine_recall@5
271
+ value: 0.6494845360824743
272
+ name: Cosine Recall@5
273
+ - type: cosine_recall@10
274
+ value: 0.7628865979381443
275
+ name: Cosine Recall@10
276
+ - type: cosine_ndcg@10
277
+ value: 0.4964329019488686
278
+ name: Cosine Ndcg@10
279
+ - type: cosine_mrr@10
280
+ value: 0.4132302405498282
281
+ name: Cosine Mrr@10
282
+ - type: cosine_map@100
283
+ value: 0.41983416368750226
284
+ name: Cosine Map@100
285
+ ---
286
+
287
+ # SentenceTransformer based on BAAI/bge-base-en-v1.5
288
+
289
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
290
+
291
+ ## Model Details
292
+
293
+ ### Model Description
294
+ - **Model Type:** Sentence Transformer
295
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
296
+ - **Maximum Sequence Length:** 512 tokens
297
+ - **Output Dimensionality:** 768 tokens
298
+ - **Similarity Function:** Cosine Similarity
299
+ <!-- - **Training Dataset:** Unknown -->
300
+ - **Language:** en
301
+ - **License:** apache-2.0
302
+
303
+ ### Model Sources
304
+
305
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
306
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
307
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
308
+
309
+ ### Full Model Architecture
310
+
311
+ ```
312
+ SentenceTransformer(
313
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
314
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
315
+ (2): Normalize()
316
+ )
317
+ ```
318
+
319
+ ## Usage
320
+
321
+ ### Direct Usage (Sentence Transformers)
322
+
323
+ First install the Sentence Transformers library:
324
+
325
+ ```bash
326
+ pip install -U sentence-transformers
327
+ ```
328
+
329
+ Then you can load this model and run inference.
330
+ ```python
331
+ from sentence_transformers import SentenceTransformer
332
+
333
+ # Download from the 🤗 Hub
334
+ model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v18")
335
+ # Run inference
336
+ sentences = [
337
+ 'remediate the incident, promptly notify relevant individuals, and report such data security incidents to the regulatory department(s). Thus, you should have a robust security breach response mechanism in place. ## 7\\. Cross border data transfer and data localization requirements: Under DSL, Critical Information Infrastructure Operators are required to store the important data in the territory of China and cross-border transfer is regulated by the CSL. CIIOs need to conduct a security assessment in accordance with the measures jointly defined by CAC and the relevant departments under the State Council for the cross-border transfer of important data for business necessity. For non Critical Information Infrastructure operators, the important data cross-border transfer will be regulated by the measures announced by the Cyberspace Administration of China (CAC) and other authorities. However, those “measures” have still not yet been released. DSL also intends to establish a data national security review and export control system to restrict the cross-border transmission of data',
338
+ 'What are the requirements for storing important data in the territory of China under DSL?',
339
+ 'What is the margin of error generally estimated for worldwide Monthly Active People (MAP)?',
340
+ ]
341
+ embeddings = model.encode(sentences)
342
+ print(embeddings.shape)
343
+ # [3, 768]
344
+
345
+ # Get the similarity scores for the embeddings
346
+ similarities = model.similarity(embeddings, embeddings)
347
+ print(similarities.shape)
348
+ # [3, 3]
349
+ ```
350
+
351
+ <!--
352
+ ### Direct Usage (Transformers)
353
+
354
+ <details><summary>Click to see the direct usage in Transformers</summary>
355
+
356
+ </details>
357
+ -->
358
+
359
+ <!--
360
+ ### Downstream Usage (Sentence Transformers)
361
+
362
+ You can finetune this model on your own dataset.
363
+
364
+ <details><summary>Click to expand</summary>
365
+
366
+ </details>
367
+ -->
368
+
369
+ <!--
370
+ ### Out-of-Scope Use
371
+
372
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
373
+ -->
374
+
375
+ ## Evaluation
376
+
377
+ ### Metrics
378
+
379
+ #### Information Retrieval
380
+ * Dataset: `dim_768`
381
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
382
+
383
+ | Metric | Value |
384
+ |:--------------------|:-----------|
385
+ | cosine_accuracy@1 | 0.2784 |
386
+ | cosine_accuracy@3 | 0.5464 |
387
+ | cosine_accuracy@5 | 0.6495 |
388
+ | cosine_accuracy@10 | 0.7835 |
389
+ | cosine_precision@1 | 0.2784 |
390
+ | cosine_precision@3 | 0.1821 |
391
+ | cosine_precision@5 | 0.1299 |
392
+ | cosine_precision@10 | 0.0784 |
393
+ | cosine_recall@1 | 0.2784 |
394
+ | cosine_recall@3 | 0.5464 |
395
+ | cosine_recall@5 | 0.6495 |
396
+ | cosine_recall@10 | 0.7835 |
397
+ | cosine_ndcg@10 | 0.5204 |
398
+ | cosine_mrr@10 | 0.4374 |
399
+ | **cosine_map@100** | **0.4438** |
400
+
401
+ #### Information Retrieval
402
+ * Dataset: `dim_512`
403
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
404
+
405
+ | Metric | Value |
406
+ |:--------------------|:-----------|
407
+ | cosine_accuracy@1 | 0.2887 |
408
+ | cosine_accuracy@3 | 0.5464 |
409
+ | cosine_accuracy@5 | 0.6598 |
410
+ | cosine_accuracy@10 | 0.7732 |
411
+ | cosine_precision@1 | 0.2887 |
412
+ | cosine_precision@3 | 0.1821 |
413
+ | cosine_precision@5 | 0.132 |
414
+ | cosine_precision@10 | 0.0773 |
415
+ | cosine_recall@1 | 0.2887 |
416
+ | cosine_recall@3 | 0.5464 |
417
+ | cosine_recall@5 | 0.6598 |
418
+ | cosine_recall@10 | 0.7732 |
419
+ | cosine_ndcg@10 | 0.5235 |
420
+ | cosine_mrr@10 | 0.4444 |
421
+ | **cosine_map@100** | **0.4515** |
422
+
423
+ #### Information Retrieval
424
+ * Dataset: `dim_256`
425
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
426
+
427
+ | Metric | Value |
428
+ |:--------------------|:-----------|
429
+ | cosine_accuracy@1 | 0.268 |
430
+ | cosine_accuracy@3 | 0.4845 |
431
+ | cosine_accuracy@5 | 0.6495 |
432
+ | cosine_accuracy@10 | 0.7629 |
433
+ | cosine_precision@1 | 0.268 |
434
+ | cosine_precision@3 | 0.1615 |
435
+ | cosine_precision@5 | 0.1299 |
436
+ | cosine_precision@10 | 0.0763 |
437
+ | cosine_recall@1 | 0.268 |
438
+ | cosine_recall@3 | 0.4845 |
439
+ | cosine_recall@5 | 0.6495 |
440
+ | cosine_recall@10 | 0.7629 |
441
+ | cosine_ndcg@10 | 0.4964 |
442
+ | cosine_mrr@10 | 0.4132 |
443
+ | **cosine_map@100** | **0.4198** |
444
+
445
+ <!--
446
+ ## Bias, Risks and Limitations
447
+
448
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
449
+ -->
450
+
451
+ <!--
452
+ ### Recommendations
453
+
454
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
455
+ -->
456
+
457
+ ## Training Details
458
+
459
+ ### Training Dataset
460
+
461
+ #### Unnamed Dataset
462
+
463
+
464
+ * Size: 1,872 training samples
465
+ * Columns: <code>positive</code> and <code>anchor</code>
466
+ * Approximate statistics based on the first 1000 samples:
467
+ | | positive | anchor |
468
+ |:--------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
469
+ | type | string | string |
470
+ | details | <ul><li>min: 4 tokens</li><li>mean: 207.32 tokens</li><li>max: 414 tokens</li></ul> | <ul><li>min: 2 tokens</li><li>mean: 21.79 tokens</li><li>max: 102 tokens</li></ul> |
471
+ * Samples:
472
+ | positive | anchor |
473
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
474
+ | <code>Automation PrivacyCenter.Cloud | Data Mapping | DSR Automation | Assessment Automation | Vendor Assessment | Breach Management | Privacy Notice Learn more Sensitive Data Intelligence Discover & Classify Structured and Unstructured Data | People Data Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl through real-time streaming platforms Learn more Data Consent Automation First Party Consent | Third Party & Cookie Consent Learn more Data Security Posture Management Secure sensitive data in hybrid multicloud and SaaS environments Learn more Data Breach Impact Analysis & Response Analyze impact of a data breach and coordinate response per global regulatory obligations Learn more Data Catalog Automatically catalog datasets and enable users to find, understand, trust and access data Learn more Data Lineage Track changes and transformations of, PrivacyCenter.Cloud | Data Mapping | DSR Automation | Assessment Automation | Vendor Assessment | Breach Management | Privacy Notice Learn more Sensitive Data Intelligence Discover & Classify Structured and Unstructured Data | People Data Graph Learn more Data Flow Intelligence & Governance Prevent sensitive data sprawl through real-time streaming platforms Learn more Data Consent Automation First Party Consent | Third Party & Cookie Consent Learn more Data Security Posture Management Secure sensitive data in hybrid multicloud and SaaS environments Learn more Data Breach Impact Analysis & Response Analyze impact of a data breach and coordinate response per global regulatory obligations Learn more Data Catalog Automatically catalog datasets and enable users to find, understand, trust and access data Learn more Data Lineage Track changes and transformations of data throughout its</code> | <code>What is the purpose of Third Party & Cookie Consent in data automation and security?</code> |
475
+ | <code>the Tietosuojalaki. ### Greece #### Greece **Effective Date** : August 28, 2019 **Region** : EMEA (Europe, Middle East, Africa) Greek Law 4624/2019 was enacted to implement the GDPR and Directive (EU) 2016/680. The Hellenic Data Protection Agency (Αρχή προστασίας δεδομένων προσωπικού χαρακτήρα) is primarily responsible for overseeing the enforcement and implementation of Law 4624/2019 as well as the ePrivacy Directive within Greece. ### Iceland #### Iceland **Effective Date** : July 15, 2018 **Region** : EMEA (Europe, Middle East, Africa) ​​Act 90/2018 on Data Protection and Processing</code> | <code>What is the role of the Hellenic Data Protection Agency in overseeing the enforcement and implementation of Greek Law 4624/2019 and the ePrivacy Directive in Greece?</code> |
476
+ | <code>EU. GDPR also applies to organizations located outside the EU (those that do not have an establishment in the EU) if they offer goods or services to, or monitor the behavior of, data subjects located in the EU, irrespective of their nationality or the company’s location. ## Data Subject Rights PDPL provides individuals rights relating to their personal data, which they can exercise. Under PDPL, the data controller should ensure the identity verification of the data subject before processing his/her data subject request. Also, the data controller must not charge for data subjects for making the data subject requests. The data subject may file a complaint to the Authority against the data controller, where the data subject does not accept the data controller’s decision regarding the request, or if the prescribed period has expired without the data subject’s receipt of any notice regarding his request. GDPR also ensures data subject rights where the data subjects can request the controller or, whatever their nationality or place of residence, concerning the processing of their personal data.” Regarding extraterritorial scope, GDPR applies to organizations that are not established in the EU, but instead monitor individuals’ behavior, as long as their behavior occurs in the EU. GDPR also applies to organizations located outside the EU (those that do not have an establishment in the EU) if they offer goods or services to, or monitor the behavior of, data subjects located in the EU, irrespective of their nationality or the company’s location. ## Rights Both regulations give individuals rights relating to their personal data, which they can exercise. Under LPPD, the data controller must process data subject’ requests and take all necessary administrative and technical measures within 30 days. LPPD does not provide a period extension. There is no fee for the data subject’ request to data controllers. However, the data controller may impose a fee, as set by the</code> | <code>What are the data subjects' rights under GDPR regarding behavior monitoring, and how do they compare to the rights under PDPL?</code> |
477
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
478
+ ```json
479
+ {
480
+ "loss": "MultipleNegativesRankingLoss",
481
+ "matryoshka_dims": [
482
+ 768,
483
+ 512,
484
+ 256
485
+ ],
486
+ "matryoshka_weights": [
487
+ 1,
488
+ 1,
489
+ 1
490
+ ],
491
+ "n_dims_per_step": -1
492
+ }
493
+ ```
494
+
495
+ ### Training Hyperparameters
496
+ #### Non-Default Hyperparameters
497
+
498
+ - `eval_strategy`: epoch
499
+ - `per_device_train_batch_size`: 32
500
+ - `per_device_eval_batch_size`: 16
501
+ - `learning_rate`: 2e-05
502
+ - `num_train_epochs`: 4
503
+ - `lr_scheduler_type`: cosine
504
+ - `warmup_ratio`: 0.1
505
+ - `bf16`: True
506
+ - `tf32`: True
507
+ - `load_best_model_at_end`: True
508
+ - `optim`: adamw_torch_fused
509
+ - `batch_sampler`: no_duplicates
510
+
511
+ #### All Hyperparameters
512
+ <details><summary>Click to expand</summary>
513
+
514
+ - `overwrite_output_dir`: False
515
+ - `do_predict`: False
516
+ - `eval_strategy`: epoch
517
+ - `prediction_loss_only`: True
518
+ - `per_device_train_batch_size`: 32
519
+ - `per_device_eval_batch_size`: 16
520
+ - `per_gpu_train_batch_size`: None
521
+ - `per_gpu_eval_batch_size`: None
522
+ - `gradient_accumulation_steps`: 1
523
+ - `eval_accumulation_steps`: None
524
+ - `learning_rate`: 2e-05
525
+ - `weight_decay`: 0.0
526
+ - `adam_beta1`: 0.9
527
+ - `adam_beta2`: 0.999
528
+ - `adam_epsilon`: 1e-08
529
+ - `max_grad_norm`: 1.0
530
+ - `num_train_epochs`: 4
531
+ - `max_steps`: -1
532
+ - `lr_scheduler_type`: cosine
533
+ - `lr_scheduler_kwargs`: {}
534
+ - `warmup_ratio`: 0.1
535
+ - `warmup_steps`: 0
536
+ - `log_level`: passive
537
+ - `log_level_replica`: warning
538
+ - `log_on_each_node`: True
539
+ - `logging_nan_inf_filter`: True
540
+ - `save_safetensors`: True
541
+ - `save_on_each_node`: False
542
+ - `save_only_model`: False
543
+ - `restore_callback_states_from_checkpoint`: False
544
+ - `no_cuda`: False
545
+ - `use_cpu`: False
546
+ - `use_mps_device`: False
547
+ - `seed`: 42
548
+ - `data_seed`: None
549
+ - `jit_mode_eval`: False
550
+ - `use_ipex`: False
551
+ - `bf16`: True
552
+ - `fp16`: False
553
+ - `fp16_opt_level`: O1
554
+ - `half_precision_backend`: auto
555
+ - `bf16_full_eval`: False
556
+ - `fp16_full_eval`: False
557
+ - `tf32`: True
558
+ - `local_rank`: 0
559
+ - `ddp_backend`: None
560
+ - `tpu_num_cores`: None
561
+ - `tpu_metrics_debug`: False
562
+ - `debug`: []
563
+ - `dataloader_drop_last`: False
564
+ - `dataloader_num_workers`: 0
565
+ - `dataloader_prefetch_factor`: None
566
+ - `past_index`: -1
567
+ - `disable_tqdm`: False
568
+ - `remove_unused_columns`: True
569
+ - `label_names`: None
570
+ - `load_best_model_at_end`: True
571
+ - `ignore_data_skip`: False
572
+ - `fsdp`: []
573
+ - `fsdp_min_num_params`: 0
574
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
575
+ - `fsdp_transformer_layer_cls_to_wrap`: None
576
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
577
+ - `deepspeed`: None
578
+ - `label_smoothing_factor`: 0.0
579
+ - `optim`: adamw_torch_fused
580
+ - `optim_args`: None
581
+ - `adafactor`: False
582
+ - `group_by_length`: False
583
+ - `length_column_name`: length
584
+ - `ddp_find_unused_parameters`: None
585
+ - `ddp_bucket_cap_mb`: None
586
+ - `ddp_broadcast_buffers`: False
587
+ - `dataloader_pin_memory`: True
588
+ - `dataloader_persistent_workers`: False
589
+ - `skip_memory_metrics`: True
590
+ - `use_legacy_prediction_loop`: False
591
+ - `push_to_hub`: False
592
+ - `resume_from_checkpoint`: None
593
+ - `hub_model_id`: None
594
+ - `hub_strategy`: every_save
595
+ - `hub_private_repo`: False
596
+ - `hub_always_push`: False
597
+ - `gradient_checkpointing`: False
598
+ - `gradient_checkpointing_kwargs`: None
599
+ - `include_inputs_for_metrics`: False
600
+ - `eval_do_concat_batches`: True
601
+ - `fp16_backend`: auto
602
+ - `push_to_hub_model_id`: None
603
+ - `push_to_hub_organization`: None
604
+ - `mp_parameters`:
605
+ - `auto_find_batch_size`: False
606
+ - `full_determinism`: False
607
+ - `torchdynamo`: None
608
+ - `ray_scope`: last
609
+ - `ddp_timeout`: 1800
610
+ - `torch_compile`: False
611
+ - `torch_compile_backend`: None
612
+ - `torch_compile_mode`: None
613
+ - `dispatch_batches`: None
614
+ - `split_batches`: None
615
+ - `include_tokens_per_second`: False
616
+ - `include_num_input_tokens_seen`: False
617
+ - `neftune_noise_alpha`: None
618
+ - `optim_target_modules`: None
619
+ - `batch_eval_metrics`: False
620
+ - `batch_sampler`: no_duplicates
621
+ - `multi_dataset_batch_sampler`: proportional
622
+
623
+ </details>
624
+
625
+ ### Training Logs
626
+ | Epoch | Step | Training Loss | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_768_cosine_map@100 |
627
+ |:-------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|
628
+ | 0.1695 | 10 | 3.9813 | - | - | - |
629
+ | 0.3390 | 20 | 2.6276 | - | - | - |
630
+ | 0.5085 | 30 | 1.7029 | - | - | - |
631
+ | 0.6780 | 40 | 0.641 | - | - | - |
632
+ | 0.8475 | 50 | 0.391 | - | - | - |
633
+ | **1.0** | **59** | **-** | **0.4761** | **0.4928** | **0.4919** |
634
+ | 0.1695 | 10 | 1.362 | - | - | - |
635
+ | 0.3390 | 20 | 0.7574 | - | - | - |
636
+ | 0.5085 | 30 | 0.5287 | - | - | - |
637
+ | 0.6780 | 40 | 0.096 | - | - | - |
638
+ | 0.8475 | 50 | 0.0699 | - | - | - |
639
+ | **1.0** | **59** | **-** | **0.4483** | **0.4913** | **0.4925** |
640
+ | 1.0169 | 60 | 0.25 | - | - | - |
641
+ | 1.1864 | 70 | 1.043 | - | - | - |
642
+ | 1.3559 | 80 | 0.8176 | - | - | - |
643
+ | 1.5254 | 90 | 0.6276 | - | - | - |
644
+ | 1.6949 | 100 | 0.0992 | - | - | - |
645
+ | 1.8644 | 110 | 0.0993 | - | - | - |
646
+ | 2.0 | 118 | - | 0.4469 | 0.4785 | 0.4862 |
647
+ | 0.1695 | 10 | 1.0617 | - | - | - |
648
+ | 0.3390 | 20 | 0.7721 | - | - | - |
649
+ | 0.5085 | 30 | 0.6991 | - | - | - |
650
+ | 0.6780 | 40 | 0.095 | - | - | - |
651
+ | 0.8475 | 50 | 0.0695 | - | - | - |
652
+ | **1.0** | **59** | **-** | **0.4519** | **0.4786** | **0.4748** |
653
+ | 1.0169 | 60 | 0.1892 | - | - | - |
654
+ | 1.1864 | 70 | 0.7125 | - | - | - |
655
+ | 1.3559 | 80 | 0.5113 | - | - | - |
656
+ | 1.5254 | 90 | 0.437 | - | - | - |
657
+ | 1.6949 | 100 | 0.0432 | - | - | - |
658
+ | 1.8644 | 110 | 0.0471 | - | - | - |
659
+ | 2.0 | 118 | - | 0.4347 | 0.4581 | 0.4516 |
660
+ | 0.1695 | 10 | 0.7237 | - | - | - |
661
+ | 0.3390 | 20 | 0.5054 | - | - | - |
662
+ | 0.5085 | 30 | 0.4194 | - | - | - |
663
+ | 0.6780 | 40 | 0.0437 | - | - | - |
664
+ | 0.8475 | 50 | 0.0388 | - | - | - |
665
+ | **1.0** | **59** | **-** | **0.4582** | **0.4692** | **0.4748** |
666
+ | 1.0169 | 60 | 0.1513 | - | - | - |
667
+ | 1.1864 | 70 | 0.5249 | - | - | - |
668
+ | 1.3559 | 80 | 0.3878 | - | - | - |
669
+ | 1.5254 | 90 | 0.3353 | - | - | - |
670
+ | 1.6949 | 100 | 0.0223 | - | - | - |
671
+ | 1.8644 | 110 | 0.0248 | - | - | - |
672
+ | 2.0 | 118 | - | 0.4251 | 0.4460 | 0.4439 |
673
+ | 2.0339 | 120 | 0.1012 | - | - | - |
674
+ | 2.2034 | 130 | 0.3534 | - | - | - |
675
+ | 2.3729 | 140 | 0.2937 | - | - | - |
676
+ | 2.5424 | 150 | 0.1769 | - | - | - |
677
+ | 2.7119 | 160 | 0.0107 | - | - | - |
678
+ | 2.8814 | 170 | 0.0102 | - | - | - |
679
+ | 3.0 | 177 | - | 0.4245 | 0.4448 | 0.4488 |
680
+ | 3.0508 | 180 | 0.1054 | - | - | - |
681
+ | 3.2203 | 190 | 0.2246 | - | - | - |
682
+ | 3.3898 | 200 | 0.2323 | - | - | - |
683
+ | 3.5593 | 210 | 0.1045 | - | - | - |
684
+ | 3.7288 | 220 | 0.0082 | - | - | - |
685
+ | 3.8983 | 230 | 0.0123 | - | - | - |
686
+ | 4.0 | 236 | - | 0.4198 | 0.4515 | 0.4438 |
687
+
688
+ * The bold row denotes the saved checkpoint.
689
+
690
+ ### Framework Versions
691
+ - Python: 3.10.14
692
+ - Sentence Transformers: 3.0.1
693
+ - Transformers: 4.41.2
694
+ - PyTorch: 2.1.2+cu121
695
+ - Accelerate: 0.31.0
696
+ - Datasets: 2.19.1
697
+ - Tokenizers: 0.19.1
698
+
699
+ ## Citation
700
+
701
+ ### BibTeX
702
+
703
+ #### Sentence Transformers
704
+ ```bibtex
705
+ @inproceedings{reimers-2019-sentence-bert,
706
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
707
+ author = "Reimers, Nils and Gurevych, Iryna",
708
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
709
+ month = "11",
710
+ year = "2019",
711
+ publisher = "Association for Computational Linguistics",
712
+ url = "https://arxiv.org/abs/1908.10084",
713
+ }
714
+ ```
715
+
716
+ #### MatryoshkaLoss
717
+ ```bibtex
718
+ @misc{kusupati2024matryoshka,
719
+ title={Matryoshka Representation Learning},
720
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
721
+ year={2024},
722
+ eprint={2205.13147},
723
+ archivePrefix={arXiv},
724
+ primaryClass={cs.LG}
725
+ }
726
+ ```
727
+
728
+ #### MultipleNegativesRankingLoss
729
+ ```bibtex
730
+ @misc{henderson2017efficient,
731
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
732
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
733
+ year={2017},
734
+ eprint={1705.00652},
735
+ archivePrefix={arXiv},
736
+ primaryClass={cs.CL}
737
+ }
738
+ ```
739
+
740
+ <!--
741
+ ## Glossary
742
+
743
+ *Clearly define terms in order to be accessible across audiences.*
744
+ -->
745
+
746
+ <!--
747
+ ## Model Card Authors
748
+
749
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
750
+ -->
751
+
752
+ <!--
753
+ ## Model Card Contact
754
+
755
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
756
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-base-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.1.2+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92bebdc4b1492804bd24b7ab24676bb022c3a7f684a50e42cc7907a30154650e
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff