pavanmantha commited on
Commit
18bb063
1 Parent(s): 7f5f3ea

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,773 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - generated_from_trainer
11
+ - dataset_size:4247
12
+ - loss:MatryoshkaLoss
13
+ - loss:MultipleNegativesRankingLoss
14
+ base_model: BAAI/bge-base-en-v1.5
15
+ datasets: []
16
+ metrics:
17
+ - cosine_accuracy@1
18
+ - cosine_accuracy@3
19
+ - cosine_accuracy@5
20
+ - cosine_accuracy@10
21
+ - cosine_precision@1
22
+ - cosine_precision@3
23
+ - cosine_precision@5
24
+ - cosine_precision@10
25
+ - cosine_recall@1
26
+ - cosine_recall@3
27
+ - cosine_recall@5
28
+ - cosine_recall@10
29
+ - cosine_ndcg@10
30
+ - cosine_mrr@10
31
+ - cosine_map@100
32
+ widget:
33
+ - source_sentence: The Opa1 protein localizes to the mitochondria.Opa1 is found normally
34
+ in the mitochondrial intermembrane space.
35
+ sentences:
36
+ - Which is the cellular localization of the protein Opa1?
37
+ - Which are the genes responsible for Dyskeratosis Congenita?
38
+ - List blood marker for Non-Hodgkin lymphoma.
39
+ - source_sentence: CorrSite identifies potential allosteric ligand-binding sites based
40
+ on motion correlation analyses between cavities.We find that CARDS captures allosteric
41
+ communication between the two cAMP-Binding Domains (CBDs)Overall, it is demonstrated
42
+ that the communication pathways could be multiple and intrinsically disposed,
43
+ and the MC path generation approach provides an effective tool for the prediction
44
+ of key residues that mediate the allosteric communication in an ensemble of pathways
45
+ and functionally plausible residuesWe utilized a data set of 24 known allosteric
46
+ sites from 23 monomer proteins to calculate the correlations between potential
47
+ ligand-binding sites and corresponding orthosteric sites using a Gaussian network
48
+ model (GNM)Here, we introduce the Correlation of All Rotameric and Dynamical States
49
+ (CARDS) framework for quantifying correlations between both the structure and
50
+ disorder of different regions of a proteinWe present a novel method, "MutInf",
51
+ to identify statistically significant correlated motions from equilibrium molecular
52
+ dynamics simulationsCorrSite identifies potential allosteric ligand-binding sites
53
+ based on motion correlation analyses between cavities.Here, a Monte Carlo (MC)
54
+ path generation approach is proposed and implemented to define likely allosteric
55
+ pathways through generating an ensemble of maximum probability paths.Here, a Monte
56
+ Carlo (MC) path generation approach is proposed and implemented to define likely
57
+ allosteric pathways through generating an ensemble of maximum probability paths.
58
+ Overall, it is demonstrated that the communication pathways could be multiple
59
+ and intrinsically disposed, and the MC path generation approach provides an effective
60
+ tool for the prediction of key residues that mediate the allosteric communication
61
+ in an ensemble of pathways and functionally plausible residues We utilized a data
62
+ set of 24 known allosteric sites from 23 monomer proteins to calculate the correlations
63
+ between potential ligand-binding sites and corresponding orthosteric sites using
64
+ a Gaussian network model (GNM)A Monte Carlo (MC) path generation approach is proposed
65
+ and implemented to define likely allosteric pathways through generating an ensemble
66
+ of maximum probability paths. A novel method, "MutInf", to identify statistically
67
+ significant correlated motions from equilibrium molecular dynamics simulations.
68
+ CorrSite identifies potential alloster-binding sites based on motion correlation
69
+ analyses between cavities. The Correlation of All Rotameric and Dynamical States
70
+ (CARDS) framework for quantifying correlations between both the structure and
71
+ disorder of different regions of a proteinComputational tools for predicting allosteric
72
+ pathways in proteins include MCPath, MutInf, pySCA, CorrSite, and CARDS.
73
+ sentences:
74
+ - Computational tools for predicting allosteric pathways in proteins
75
+ - What is PANTHER-PSEP?
76
+ - What illness is transmitted by the Lone Star Tick, Amblyomma americanum?
77
+ - source_sentence: "Dopaminergic drugs should be given in patients with BMS. \nCatuama\
78
+ \ reduces the symptoms of BMS and may be a novel therapeutic strategy for the\
79
+ \ treatment of this disease.\nCapsaicin, alpha-lipoic acid (ALA), and clonazepam\
80
+ \ were those that showed more reduction in symptoms of BMS.\nTreatment with placebos\
81
+ \ produced a response that was 72% as large as the response to active drugs"
82
+ sentences:
83
+ - What is the cyberknife used for?
84
+ - Which compounds exist that are thyroid hormone analogs?
85
+ - Which are the drugs utilized for the burning mouth syndrome?
86
+ - source_sentence: Tinea is a superficial fungal infections of the skin.
87
+ sentences:
88
+ - Which molecule is targeted by a monoclonal antibody Mepolizumab?
89
+ - What disease is tinea ?
90
+ - Which algorithm is used for detection of long repeat expansions?
91
+ - source_sentence: Basset is an open source package which applies CNNs to learn the
92
+ functional activity of DNA sequences from genomics data. Basset was trained on
93
+ a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq,
94
+ and demonstrated greater predictive accuracy than previous methods. Basset predictions
95
+ for the change in accessibility between variant alleles were far greater for Genome-wide
96
+ association study (GWAS) SNPs that are likely to be causal relative to nearby
97
+ SNPs in linkage disequilibrium with them. With Basset, a researcher can perform
98
+ a single sequencing assay in their cell type of interest and simultaneously learn
99
+ that cell's chromatin accessibility code and annotate every mutation in the genome
100
+ with its influence on present accessibility and latent potential for accessibility.
101
+ Thus, Basset offers a powerful computational approach to annotate and interpret
102
+ the noncoding genome.
103
+ sentences:
104
+ - Givosiran is used for treatment of which disease?
105
+ - Describe the applicability of Basset in the context of deep learning
106
+ - What is the causative agent of the "Panama disease" affecting bananas?
107
+ pipeline_tag: sentence-similarity
108
+ model-index:
109
+ - name: BGE base BioASQ Matryoshka
110
+ results:
111
+ - task:
112
+ type: information-retrieval
113
+ name: Information Retrieval
114
+ dataset:
115
+ name: dim 768
116
+ type: dim_768
117
+ metrics:
118
+ - type: cosine_accuracy@1
119
+ value: 0.8432203389830508
120
+ name: Cosine Accuracy@1
121
+ - type: cosine_accuracy@3
122
+ value: 0.9427966101694916
123
+ name: Cosine Accuracy@3
124
+ - type: cosine_accuracy@5
125
+ value: 0.961864406779661
126
+ name: Cosine Accuracy@5
127
+ - type: cosine_accuracy@10
128
+ value: 0.9788135593220338
129
+ name: Cosine Accuracy@10
130
+ - type: cosine_precision@1
131
+ value: 0.8432203389830508
132
+ name: Cosine Precision@1
133
+ - type: cosine_precision@3
134
+ value: 0.3142655367231638
135
+ name: Cosine Precision@3
136
+ - type: cosine_precision@5
137
+ value: 0.19237288135593222
138
+ name: Cosine Precision@5
139
+ - type: cosine_precision@10
140
+ value: 0.0978813559322034
141
+ name: Cosine Precision@10
142
+ - type: cosine_recall@1
143
+ value: 0.8432203389830508
144
+ name: Cosine Recall@1
145
+ - type: cosine_recall@3
146
+ value: 0.9427966101694916
147
+ name: Cosine Recall@3
148
+ - type: cosine_recall@5
149
+ value: 0.961864406779661
150
+ name: Cosine Recall@5
151
+ - type: cosine_recall@10
152
+ value: 0.9788135593220338
153
+ name: Cosine Recall@10
154
+ - type: cosine_ndcg@10
155
+ value: 0.9167805960832026
156
+ name: Cosine Ndcg@10
157
+ - type: cosine_mrr@10
158
+ value: 0.8963327280064567
159
+ name: Cosine Mrr@10
160
+ - type: cosine_map@100
161
+ value: 0.8971987609787653
162
+ name: Cosine Map@100
163
+ - task:
164
+ type: information-retrieval
165
+ name: Information Retrieval
166
+ dataset:
167
+ name: dim 512
168
+ type: dim_512
169
+ metrics:
170
+ - type: cosine_accuracy@1
171
+ value: 0.8538135593220338
172
+ name: Cosine Accuracy@1
173
+ - type: cosine_accuracy@3
174
+ value: 0.9427966101694916
175
+ name: Cosine Accuracy@3
176
+ - type: cosine_accuracy@5
177
+ value: 0.961864406779661
178
+ name: Cosine Accuracy@5
179
+ - type: cosine_accuracy@10
180
+ value: 0.9745762711864406
181
+ name: Cosine Accuracy@10
182
+ - type: cosine_precision@1
183
+ value: 0.8538135593220338
184
+ name: Cosine Precision@1
185
+ - type: cosine_precision@3
186
+ value: 0.3142655367231638
187
+ name: Cosine Precision@3
188
+ - type: cosine_precision@5
189
+ value: 0.19237288135593222
190
+ name: Cosine Precision@5
191
+ - type: cosine_precision@10
192
+ value: 0.09745762711864407
193
+ name: Cosine Precision@10
194
+ - type: cosine_recall@1
195
+ value: 0.8538135593220338
196
+ name: Cosine Recall@1
197
+ - type: cosine_recall@3
198
+ value: 0.9427966101694916
199
+ name: Cosine Recall@3
200
+ - type: cosine_recall@5
201
+ value: 0.961864406779661
202
+ name: Cosine Recall@5
203
+ - type: cosine_recall@10
204
+ value: 0.9745762711864406
205
+ name: Cosine Recall@10
206
+ - type: cosine_ndcg@10
207
+ value: 0.9198462326957965
208
+ name: Cosine Ndcg@10
209
+ - type: cosine_mrr@10
210
+ value: 0.9016772598870054
211
+ name: Cosine Mrr@10
212
+ - type: cosine_map@100
213
+ value: 0.9026755533837086
214
+ name: Cosine Map@100
215
+ - task:
216
+ type: information-retrieval
217
+ name: Information Retrieval
218
+ dataset:
219
+ name: dim 256
220
+ type: dim_256
221
+ metrics:
222
+ - type: cosine_accuracy@1
223
+ value: 0.8453389830508474
224
+ name: Cosine Accuracy@1
225
+ - type: cosine_accuracy@3
226
+ value: 0.9385593220338984
227
+ name: Cosine Accuracy@3
228
+ - type: cosine_accuracy@5
229
+ value: 0.9555084745762712
230
+ name: Cosine Accuracy@5
231
+ - type: cosine_accuracy@10
232
+ value: 0.9745762711864406
233
+ name: Cosine Accuracy@10
234
+ - type: cosine_precision@1
235
+ value: 0.8453389830508474
236
+ name: Cosine Precision@1
237
+ - type: cosine_precision@3
238
+ value: 0.3128531073446327
239
+ name: Cosine Precision@3
240
+ - type: cosine_precision@5
241
+ value: 0.19110169491525425
242
+ name: Cosine Precision@5
243
+ - type: cosine_precision@10
244
+ value: 0.09745762711864407
245
+ name: Cosine Precision@10
246
+ - type: cosine_recall@1
247
+ value: 0.8453389830508474
248
+ name: Cosine Recall@1
249
+ - type: cosine_recall@3
250
+ value: 0.9385593220338984
251
+ name: Cosine Recall@3
252
+ - type: cosine_recall@5
253
+ value: 0.9555084745762712
254
+ name: Cosine Recall@5
255
+ - type: cosine_recall@10
256
+ value: 0.9745762711864406
257
+ name: Cosine Recall@10
258
+ - type: cosine_ndcg@10
259
+ value: 0.914207272128957
260
+ name: Cosine Ndcg@10
261
+ - type: cosine_mrr@10
262
+ value: 0.8944528517621736
263
+ name: Cosine Mrr@10
264
+ - type: cosine_map@100
265
+ value: 0.8952712251263324
266
+ name: Cosine Map@100
267
+ - task:
268
+ type: information-retrieval
269
+ name: Information Retrieval
270
+ dataset:
271
+ name: dim 128
272
+ type: dim_128
273
+ metrics:
274
+ - type: cosine_accuracy@1
275
+ value: 0.8220338983050848
276
+ name: Cosine Accuracy@1
277
+ - type: cosine_accuracy@3
278
+ value: 0.9279661016949152
279
+ name: Cosine Accuracy@3
280
+ - type: cosine_accuracy@5
281
+ value: 0.9449152542372882
282
+ name: Cosine Accuracy@5
283
+ - type: cosine_accuracy@10
284
+ value: 0.9703389830508474
285
+ name: Cosine Accuracy@10
286
+ - type: cosine_precision@1
287
+ value: 0.8220338983050848
288
+ name: Cosine Precision@1
289
+ - type: cosine_precision@3
290
+ value: 0.3093220338983051
291
+ name: Cosine Precision@3
292
+ - type: cosine_precision@5
293
+ value: 0.18898305084745767
294
+ name: Cosine Precision@5
295
+ - type: cosine_precision@10
296
+ value: 0.09703389830508474
297
+ name: Cosine Precision@10
298
+ - type: cosine_recall@1
299
+ value: 0.8220338983050848
300
+ name: Cosine Recall@1
301
+ - type: cosine_recall@3
302
+ value: 0.9279661016949152
303
+ name: Cosine Recall@3
304
+ - type: cosine_recall@5
305
+ value: 0.9449152542372882
306
+ name: Cosine Recall@5
307
+ - type: cosine_recall@10
308
+ value: 0.9703389830508474
309
+ name: Cosine Recall@10
310
+ - type: cosine_ndcg@10
311
+ value: 0.901534580728345
312
+ name: Cosine Ndcg@10
313
+ - type: cosine_mrr@10
314
+ value: 0.8789800242130752
315
+ name: Cosine Mrr@10
316
+ - type: cosine_map@100
317
+ value: 0.8801051507894794
318
+ name: Cosine Map@100
319
+ ---
320
+
321
+ # BGE base BioASQ Matryoshka
322
+
323
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
324
+
325
+ ## Model Details
326
+
327
+ ### Model Description
328
+ - **Model Type:** Sentence Transformer
329
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
330
+ - **Maximum Sequence Length:** 512 tokens
331
+ - **Output Dimensionality:** 768 tokens
332
+ - **Similarity Function:** Cosine Similarity
333
+ <!-- - **Training Dataset:** Unknown -->
334
+ - **Language:** en
335
+ - **License:** apache-2.0
336
+
337
+ ### Model Sources
338
+
339
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
340
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
341
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
342
+
343
+ ### Full Model Architecture
344
+
345
+ ```
346
+ SentenceTransformer(
347
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
348
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
349
+ (2): Normalize()
350
+ )
351
+ ```
352
+
353
+ ## Usage
354
+
355
+ ### Direct Usage (Sentence Transformers)
356
+
357
+ First install the Sentence Transformers library:
358
+
359
+ ```bash
360
+ pip install -U sentence-transformers
361
+ ```
362
+
363
+ Then you can load this model and run inference.
364
+ ```python
365
+ from sentence_transformers import SentenceTransformer
366
+
367
+ # Download from the 🤗 Hub
368
+ model = SentenceTransformer("pavanmantha/bge-base-en-bioembed768")
369
+ # Run inference
370
+ sentences = [
371
+ "Basset is an open source package which applies CNNs to learn the functional activity of DNA sequences from genomics data. Basset was trained on a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq, and demonstrated greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for Genome-wide association study (GWAS) SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell's chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.",
372
+ 'Describe the applicability of Basset in the context of deep learning',
373
+ 'What is the causative agent of the "Panama disease" affecting bananas?',
374
+ ]
375
+ embeddings = model.encode(sentences)
376
+ print(embeddings.shape)
377
+ # [3, 768]
378
+
379
+ # Get the similarity scores for the embeddings
380
+ similarities = model.similarity(embeddings, embeddings)
381
+ print(similarities.shape)
382
+ # [3, 3]
383
+ ```
384
+
385
+ <!--
386
+ ### Direct Usage (Transformers)
387
+
388
+ <details><summary>Click to see the direct usage in Transformers</summary>
389
+
390
+ </details>
391
+ -->
392
+
393
+ <!--
394
+ ### Downstream Usage (Sentence Transformers)
395
+
396
+ You can finetune this model on your own dataset.
397
+
398
+ <details><summary>Click to expand</summary>
399
+
400
+ </details>
401
+ -->
402
+
403
+ <!--
404
+ ### Out-of-Scope Use
405
+
406
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
407
+ -->
408
+
409
+ ## Evaluation
410
+
411
+ ### Metrics
412
+
413
+ #### Information Retrieval
414
+ * Dataset: `dim_768`
415
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
416
+
417
+ | Metric | Value |
418
+ |:--------------------|:-----------|
419
+ | cosine_accuracy@1 | 0.8432 |
420
+ | cosine_accuracy@3 | 0.9428 |
421
+ | cosine_accuracy@5 | 0.9619 |
422
+ | cosine_accuracy@10 | 0.9788 |
423
+ | cosine_precision@1 | 0.8432 |
424
+ | cosine_precision@3 | 0.3143 |
425
+ | cosine_precision@5 | 0.1924 |
426
+ | cosine_precision@10 | 0.0979 |
427
+ | cosine_recall@1 | 0.8432 |
428
+ | cosine_recall@3 | 0.9428 |
429
+ | cosine_recall@5 | 0.9619 |
430
+ | cosine_recall@10 | 0.9788 |
431
+ | cosine_ndcg@10 | 0.9168 |
432
+ | cosine_mrr@10 | 0.8963 |
433
+ | **cosine_map@100** | **0.8972** |
434
+
435
+ #### Information Retrieval
436
+ * Dataset: `dim_512`
437
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
438
+
439
+ | Metric | Value |
440
+ |:--------------------|:-----------|
441
+ | cosine_accuracy@1 | 0.8538 |
442
+ | cosine_accuracy@3 | 0.9428 |
443
+ | cosine_accuracy@5 | 0.9619 |
444
+ | cosine_accuracy@10 | 0.9746 |
445
+ | cosine_precision@1 | 0.8538 |
446
+ | cosine_precision@3 | 0.3143 |
447
+ | cosine_precision@5 | 0.1924 |
448
+ | cosine_precision@10 | 0.0975 |
449
+ | cosine_recall@1 | 0.8538 |
450
+ | cosine_recall@3 | 0.9428 |
451
+ | cosine_recall@5 | 0.9619 |
452
+ | cosine_recall@10 | 0.9746 |
453
+ | cosine_ndcg@10 | 0.9198 |
454
+ | cosine_mrr@10 | 0.9017 |
455
+ | **cosine_map@100** | **0.9027** |
456
+
457
+ #### Information Retrieval
458
+ * Dataset: `dim_256`
459
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
460
+
461
+ | Metric | Value |
462
+ |:--------------------|:-----------|
463
+ | cosine_accuracy@1 | 0.8453 |
464
+ | cosine_accuracy@3 | 0.9386 |
465
+ | cosine_accuracy@5 | 0.9555 |
466
+ | cosine_accuracy@10 | 0.9746 |
467
+ | cosine_precision@1 | 0.8453 |
468
+ | cosine_precision@3 | 0.3129 |
469
+ | cosine_precision@5 | 0.1911 |
470
+ | cosine_precision@10 | 0.0975 |
471
+ | cosine_recall@1 | 0.8453 |
472
+ | cosine_recall@3 | 0.9386 |
473
+ | cosine_recall@5 | 0.9555 |
474
+ | cosine_recall@10 | 0.9746 |
475
+ | cosine_ndcg@10 | 0.9142 |
476
+ | cosine_mrr@10 | 0.8945 |
477
+ | **cosine_map@100** | **0.8953** |
478
+
479
+ #### Information Retrieval
480
+ * Dataset: `dim_128`
481
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
482
+
483
+ | Metric | Value |
484
+ |:--------------------|:-----------|
485
+ | cosine_accuracy@1 | 0.822 |
486
+ | cosine_accuracy@3 | 0.928 |
487
+ | cosine_accuracy@5 | 0.9449 |
488
+ | cosine_accuracy@10 | 0.9703 |
489
+ | cosine_precision@1 | 0.822 |
490
+ | cosine_precision@3 | 0.3093 |
491
+ | cosine_precision@5 | 0.189 |
492
+ | cosine_precision@10 | 0.097 |
493
+ | cosine_recall@1 | 0.822 |
494
+ | cosine_recall@3 | 0.928 |
495
+ | cosine_recall@5 | 0.9449 |
496
+ | cosine_recall@10 | 0.9703 |
497
+ | cosine_ndcg@10 | 0.9015 |
498
+ | cosine_mrr@10 | 0.879 |
499
+ | **cosine_map@100** | **0.8801** |
500
+
501
+ <!--
502
+ ## Bias, Risks and Limitations
503
+
504
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
505
+ -->
506
+
507
+ <!--
508
+ ### Recommendations
509
+
510
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
511
+ -->
512
+
513
+ ## Training Details
514
+
515
+ ### Training Dataset
516
+
517
+ #### Unnamed Dataset
518
+
519
+
520
+ * Size: 4,247 training samples
521
+ * Columns: <code>positive</code> and <code>anchor</code>
522
+ * Approximate statistics based on the first 1000 samples:
523
+ | | positive | anchor |
524
+ |:--------|:------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
525
+ | type | string | string |
526
+ | details | <ul><li>min: 3 tokens</li><li>mean: 102.44 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 15.78 tokens</li><li>max: 44 tokens</li></ul> |
527
+ * Samples:
528
+ | positive | anchor |
529
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------|
530
+ | <code>Restless legs syndrome (RLS), also known as Willis-Ekbom disease (WED), is a common movement disorder characterized by an uncontrollable urge to move because of uncomfortable, sometimes painful sensations in the legs with a diurnal variation and a release with movement.</code> | <code>Willis-Ekbom disease is also known as?</code> |
531
+ | <code>Report the outcomes of laser in situ keratomileusis (LASIK) for high myopia correction after long-term follow-up['Report the outcomes of laser in situ keratomileusis (LASIK) for high myopia correction after long-term follow-up.']Laser in situ keratomileusis is also known as LASIKLaser in situ keratomileusis (LASIK)</code> | <code>What is another name for keratomileusis?</code> |
532
+ | <code>CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them.CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them. CellMaps can easily be integrated in any web page by using an available JavaScript API. Computations and analyses are remotely executed in high-end servers, and all the functionalities are available through RESTful web services. CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them. Computations and analyses are remotely executed in high-end servers, and all the functionalities are available through RESTful web services. CellMaps can easily be integrated in any web page by using an available JavaScript API. CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them. Computations and analyses are remotely executed in high-end servers, and all the functionalities are available through RESTful web services. CellMaps can easily be integrated in any web page by using an available JavaScript API.CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them. CellMaps is an HTML5 open-source web tool that allows displaying, editing, exploring and analyzing biological networks as well as integrating metadata into them. CellMaps can easily be integrated in any web page by using an available JavaScript API. Computations and analyses are remotely executed in high-end servers, and all the functionalities are available through RESTful web services.</code> | <code>What is CellMaps?</code> |
533
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
534
+ ```json
535
+ {
536
+ "loss": "MultipleNegativesRankingLoss",
537
+ "matryoshka_dims": [
538
+ 768,
539
+ 512,
540
+ 256,
541
+ 128
542
+ ],
543
+ "matryoshka_weights": [
544
+ 1,
545
+ 1,
546
+ 1,
547
+ 1
548
+ ],
549
+ "n_dims_per_step": -1
550
+ }
551
+ ```
552
+
553
+ ### Training Hyperparameters
554
+ #### Non-Default Hyperparameters
555
+
556
+ - `eval_strategy`: epoch
557
+ - `per_device_train_batch_size`: 32
558
+ - `per_device_eval_batch_size`: 16
559
+ - `gradient_accumulation_steps`: 16
560
+ - `learning_rate`: 2e-05
561
+ - `num_train_epochs`: 10
562
+ - `lr_scheduler_type`: cosine
563
+ - `warmup_ratio`: 0.1
564
+ - `fp16`: True
565
+ - `tf32`: False
566
+ - `load_best_model_at_end`: True
567
+ - `optim`: adamw_torch_fused
568
+ - `batch_sampler`: no_duplicates
569
+
570
+ #### All Hyperparameters
571
+ <details><summary>Click to expand</summary>
572
+
573
+ - `overwrite_output_dir`: False
574
+ - `do_predict`: False
575
+ - `eval_strategy`: epoch
576
+ - `prediction_loss_only`: True
577
+ - `per_device_train_batch_size`: 32
578
+ - `per_device_eval_batch_size`: 16
579
+ - `per_gpu_train_batch_size`: None
580
+ - `per_gpu_eval_batch_size`: None
581
+ - `gradient_accumulation_steps`: 16
582
+ - `eval_accumulation_steps`: None
583
+ - `learning_rate`: 2e-05
584
+ - `weight_decay`: 0.0
585
+ - `adam_beta1`: 0.9
586
+ - `adam_beta2`: 0.999
587
+ - `adam_epsilon`: 1e-08
588
+ - `max_grad_norm`: 1.0
589
+ - `num_train_epochs`: 10
590
+ - `max_steps`: -1
591
+ - `lr_scheduler_type`: cosine
592
+ - `lr_scheduler_kwargs`: {}
593
+ - `warmup_ratio`: 0.1
594
+ - `warmup_steps`: 0
595
+ - `log_level`: passive
596
+ - `log_level_replica`: warning
597
+ - `log_on_each_node`: True
598
+ - `logging_nan_inf_filter`: True
599
+ - `save_safetensors`: True
600
+ - `save_on_each_node`: False
601
+ - `save_only_model`: False
602
+ - `restore_callback_states_from_checkpoint`: False
603
+ - `no_cuda`: False
604
+ - `use_cpu`: False
605
+ - `use_mps_device`: False
606
+ - `seed`: 42
607
+ - `data_seed`: None
608
+ - `jit_mode_eval`: False
609
+ - `use_ipex`: False
610
+ - `bf16`: False
611
+ - `fp16`: True
612
+ - `fp16_opt_level`: O1
613
+ - `half_precision_backend`: auto
614
+ - `bf16_full_eval`: False
615
+ - `fp16_full_eval`: False
616
+ - `tf32`: False
617
+ - `local_rank`: 0
618
+ - `ddp_backend`: None
619
+ - `tpu_num_cores`: None
620
+ - `tpu_metrics_debug`: False
621
+ - `debug`: []
622
+ - `dataloader_drop_last`: False
623
+ - `dataloader_num_workers`: 0
624
+ - `dataloader_prefetch_factor`: None
625
+ - `past_index`: -1
626
+ - `disable_tqdm`: False
627
+ - `remove_unused_columns`: True
628
+ - `label_names`: None
629
+ - `load_best_model_at_end`: True
630
+ - `ignore_data_skip`: False
631
+ - `fsdp`: []
632
+ - `fsdp_min_num_params`: 0
633
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
634
+ - `fsdp_transformer_layer_cls_to_wrap`: None
635
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
636
+ - `deepspeed`: None
637
+ - `label_smoothing_factor`: 0.0
638
+ - `optim`: adamw_torch_fused
639
+ - `optim_args`: None
640
+ - `adafactor`: False
641
+ - `group_by_length`: False
642
+ - `length_column_name`: length
643
+ - `ddp_find_unused_parameters`: None
644
+ - `ddp_bucket_cap_mb`: None
645
+ - `ddp_broadcast_buffers`: False
646
+ - `dataloader_pin_memory`: True
647
+ - `dataloader_persistent_workers`: False
648
+ - `skip_memory_metrics`: True
649
+ - `use_legacy_prediction_loop`: False
650
+ - `push_to_hub`: False
651
+ - `resume_from_checkpoint`: None
652
+ - `hub_model_id`: None
653
+ - `hub_strategy`: every_save
654
+ - `hub_private_repo`: False
655
+ - `hub_always_push`: False
656
+ - `gradient_checkpointing`: False
657
+ - `gradient_checkpointing_kwargs`: None
658
+ - `include_inputs_for_metrics`: False
659
+ - `eval_do_concat_batches`: True
660
+ - `fp16_backend`: auto
661
+ - `push_to_hub_model_id`: None
662
+ - `push_to_hub_organization`: None
663
+ - `mp_parameters`:
664
+ - `auto_find_batch_size`: False
665
+ - `full_determinism`: False
666
+ - `torchdynamo`: None
667
+ - `ray_scope`: last
668
+ - `ddp_timeout`: 1800
669
+ - `torch_compile`: False
670
+ - `torch_compile_backend`: None
671
+ - `torch_compile_mode`: None
672
+ - `dispatch_batches`: None
673
+ - `split_batches`: None
674
+ - `include_tokens_per_second`: False
675
+ - `include_num_input_tokens_seen`: False
676
+ - `neftune_noise_alpha`: None
677
+ - `optim_target_modules`: None
678
+ - `batch_eval_metrics`: False
679
+ - `batch_sampler`: no_duplicates
680
+ - `multi_dataset_batch_sampler`: proportional
681
+
682
+ </details>
683
+
684
+ ### Training Logs
685
+ | Epoch | Step | Training Loss | dim_128_cosine_map@100 | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_768_cosine_map@100 |
686
+ |:----------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|
687
+ | 0.9624 | 8 | - | 0.8560 | 0.8821 | 0.8904 | 0.8876 |
688
+ | 1.2030 | 10 | 1.2833 | - | - | - | - |
689
+ | 1.9248 | 16 | - | 0.8655 | 0.8808 | 0.8909 | 0.8889 |
690
+ | 2.4060 | 20 | 0.4785 | - | - | - | - |
691
+ | 2.8872 | 24 | - | 0.8720 | 0.8875 | 0.8893 | 0.8921 |
692
+ | 3.6090 | 30 | 0.2417 | - | - | - | - |
693
+ | 3.9699 | 33 | - | 0.8751 | 0.8924 | 0.8955 | 0.8960 |
694
+ | 4.8120 | 40 | 0.1607 | - | - | - | - |
695
+ | 4.9323 | 41 | - | 0.8799 | 0.8932 | 0.8964 | 0.8952 |
696
+ | 5.8947 | 49 | - | 0.8785 | 0.8944 | 0.9009 | 0.8982 |
697
+ | 6.0150 | 50 | 0.1152 | - | - | - | - |
698
+ | **6.9774** | **58** | **-** | **0.8803** | **0.8947** | **0.9018** | **0.8975** |
699
+ | 7.2180 | 60 | 0.0924 | - | - | - | - |
700
+ | 7.9398 | 66 | - | 0.8802 | 0.8956 | 0.9016 | 0.8973 |
701
+ | 8.4211 | 70 | 0.0832 | - | - | - | - |
702
+ | 8.9023 | 74 | - | 0.8801 | 0.8956 | 0.9027 | 0.8972 |
703
+ | 9.6241 | 80 | 0.074 | 0.8801 | 0.8953 | 0.9027 | 0.8972 |
704
+
705
+ * The bold row denotes the saved checkpoint.
706
+
707
+ ### Framework Versions
708
+ - Python: 3.10.13
709
+ - Sentence Transformers: 3.0.1
710
+ - Transformers: 4.41.2
711
+ - PyTorch: 2.1.2
712
+ - Accelerate: 0.31.0
713
+ - Datasets: 2.19.2
714
+ - Tokenizers: 0.19.1
715
+
716
+ ## Citation
717
+
718
+ ### BibTeX
719
+
720
+ #### Sentence Transformers
721
+ ```bibtex
722
+ @inproceedings{reimers-2019-sentence-bert,
723
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
724
+ author = "Reimers, Nils and Gurevych, Iryna",
725
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
726
+ month = "11",
727
+ year = "2019",
728
+ publisher = "Association for Computational Linguistics",
729
+ url = "https://arxiv.org/abs/1908.10084",
730
+ }
731
+ ```
732
+
733
+ #### MatryoshkaLoss
734
+ ```bibtex
735
+ @misc{kusupati2024matryoshka,
736
+ title={Matryoshka Representation Learning},
737
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
738
+ year={2024},
739
+ eprint={2205.13147},
740
+ archivePrefix={arXiv},
741
+ primaryClass={cs.LG}
742
+ }
743
+ ```
744
+
745
+ #### MultipleNegativesRankingLoss
746
+ ```bibtex
747
+ @misc{henderson2017efficient,
748
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
749
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
750
+ year={2017},
751
+ eprint={1705.00652},
752
+ archivePrefix={arXiv},
753
+ primaryClass={cs.CL}
754
+ }
755
+ ```
756
+
757
+ <!--
758
+ ## Glossary
759
+
760
+ *Clearly define terms in order to be accessible across audiences.*
761
+ -->
762
+
763
+ <!--
764
+ ## Model Card Authors
765
+
766
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
767
+ -->
768
+
769
+ <!--
770
+ ## Model Card Contact
771
+
772
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
773
+ -->
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-base-en-v1.5",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.41.2",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 30522
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.1.2"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65689a5dde7c417efb8fe6c722a2174ed823204c4858c36405d32b04dcf685d4
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": true
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff