Ghani-25 commited on
Commit
7cd4320
·
verified ·
1 Parent(s): 5ede865

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -152
README.md CHANGED
@@ -9,14 +9,15 @@ tags:
9
  - generated_from_trainer
10
  - dataset_size:31500
11
  - loss:MatryoshkaLoss
12
- - loss:CosineSimilarityLoss
13
  base_model: Ghani-25/LF_enrich_sim
14
  widget:
15
  - source_sentence: CTO and co-Founder
16
  sentences:
17
  - Responsable surpervision des départements
18
  - Senior sales executive
19
- - Injection Operations Supervisor - Industrial Efficiency - Systems & Equipment
 
 
20
  - source_sentence: Commercial Account Executive
21
  sentences:
22
  - Automation Electrician
@@ -114,7 +115,7 @@ model-index:
114
 
115
  # Our original base similarity Matryoshka
116
 
117
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
118
 
119
  ## Model Details
120
 
@@ -129,12 +130,6 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [G
129
  - **Language:** multilingual
130
  - **License:** apache-2.0
131
 
132
- ### Model Sources
133
-
134
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
135
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
136
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
137
-
138
  ### Full Model Architecture
139
 
140
  ```
@@ -163,7 +158,7 @@ model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
163
  # Run inference
164
  sentences = [
165
  'Summer Job: Export Manager',
166
- 'Responsable Export Afrique Amériques',
167
  'Clinical Project Leader',
168
  ]
169
  embeddings = model.encode(sentences)
@@ -174,6 +169,11 @@ print(embeddings.shape)
174
  similarities = model.similarity(embeddings, embeddings)
175
  print(similarities.shape)
176
  # [3, 3]
 
 
 
 
 
177
  ```
178
 
179
  <!--
@@ -285,119 +285,7 @@ You can finetune this model on your own dataset.
285
  - `optim`: adamw_torch_fused
286
 
287
  #### All Hyperparameters
288
- <details><summary>Click to expand</summary>
289
-
290
- - `overwrite_output_dir`: False
291
- - `do_predict`: False
292
- - `eval_strategy`: epoch
293
- - `prediction_loss_only`: True
294
- - `per_device_train_batch_size`: 32
295
- - `per_device_eval_batch_size`: 16
296
- - `per_gpu_train_batch_size`: None
297
- - `per_gpu_eval_batch_size`: None
298
- - `gradient_accumulation_steps`: 16
299
- - `eval_accumulation_steps`: None
300
- - `learning_rate`: 2e-05
301
- - `weight_decay`: 0.0
302
- - `adam_beta1`: 0.9
303
- - `adam_beta2`: 0.999
304
- - `adam_epsilon`: 1e-08
305
- - `max_grad_norm`: 1.0
306
- - `num_train_epochs`: 4
307
- - `max_steps`: -1
308
- - `lr_scheduler_type`: cosine
309
- - `lr_scheduler_kwargs`: {}
310
- - `warmup_ratio`: 0.1
311
- - `warmup_steps`: 0
312
- - `log_level`: passive
313
- - `log_level_replica`: warning
314
- - `log_on_each_node`: True
315
- - `logging_nan_inf_filter`: True
316
- - `save_safetensors`: True
317
- - `save_on_each_node`: False
318
- - `save_only_model`: False
319
- - `restore_callback_states_from_checkpoint`: False
320
- - `no_cuda`: False
321
- - `use_cpu`: False
322
- - `use_mps_device`: False
323
- - `seed`: 42
324
- - `data_seed`: None
325
- - `jit_mode_eval`: False
326
- - `use_ipex`: False
327
- - `bf16`: True
328
- - `fp16`: False
329
- - `fp16_opt_level`: O1
330
- - `half_precision_backend`: auto
331
- - `bf16_full_eval`: False
332
- - `fp16_full_eval`: False
333
- - `tf32`: True
334
- - `local_rank`: 0
335
- - `ddp_backend`: None
336
- - `tpu_num_cores`: None
337
- - `tpu_metrics_debug`: False
338
- - `debug`: []
339
- - `dataloader_drop_last`: False
340
- - `dataloader_num_workers`: 0
341
- - `dataloader_prefetch_factor`: None
342
- - `past_index`: -1
343
- - `disable_tqdm`: False
344
- - `remove_unused_columns`: True
345
- - `label_names`: None
346
- - `load_best_model_at_end`: True
347
- - `ignore_data_skip`: False
348
- - `fsdp`: []
349
- - `fsdp_min_num_params`: 0
350
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
351
- - `fsdp_transformer_layer_cls_to_wrap`: None
352
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
353
- - `deepspeed`: None
354
- - `label_smoothing_factor`: 0.0
355
- - `optim`: adamw_torch_fused
356
- - `optim_args`: None
357
- - `adafactor`: False
358
- - `group_by_length`: False
359
- - `length_column_name`: length
360
- - `ddp_find_unused_parameters`: None
361
- - `ddp_bucket_cap_mb`: None
362
- - `ddp_broadcast_buffers`: False
363
- - `dataloader_pin_memory`: True
364
- - `dataloader_persistent_workers`: False
365
- - `skip_memory_metrics`: True
366
- - `use_legacy_prediction_loop`: False
367
- - `push_to_hub`: False
368
- - `resume_from_checkpoint`: None
369
- - `hub_model_id`: None
370
- - `hub_strategy`: every_save
371
- - `hub_private_repo`: False
372
- - `hub_always_push`: False
373
- - `gradient_checkpointing`: False
374
- - `gradient_checkpointing_kwargs`: None
375
- - `include_inputs_for_metrics`: False
376
- - `eval_do_concat_batches`: True
377
- - `fp16_backend`: auto
378
- - `push_to_hub_model_id`: None
379
- - `push_to_hub_organization`: None
380
- - `mp_parameters`:
381
- - `auto_find_batch_size`: False
382
- - `full_determinism`: False
383
- - `torchdynamo`: None
384
- - `ray_scope`: last
385
- - `ddp_timeout`: 1800
386
- - `torch_compile`: False
387
- - `torch_compile_backend`: None
388
- - `torch_compile_mode`: None
389
- - `dispatch_batches`: None
390
- - `split_batches`: None
391
- - `include_tokens_per_second`: False
392
- - `include_num_input_tokens_seen`: False
393
- - `neftune_noise_alpha`: None
394
- - `optim_target_modules`: None
395
- - `batch_eval_metrics`: False
396
- - `prompts`: None
397
- - `batch_sampler`: batch_sampler
398
- - `multi_dataset_batch_sampler`: proportional
399
-
400
- </details>
401
 
402
  ### Training Logs
403
  | Epoch | Step | Training Loss | dim_768_spearman_cosine | dim_512_spearman_cosine | dim_256_spearman_cosine | dim_128_spearman_cosine | dim_64_spearman_cosine |
@@ -442,35 +330,6 @@ You can finetune this model on your own dataset.
442
  - Datasets: 2.19.1
443
  - Tokenizers: 0.19.1
444
 
445
- ## Citation
446
-
447
- ### BibTeX
448
-
449
- #### Sentence Transformers
450
- ```bibtex
451
- @inproceedings{reimers-2019-sentence-bert,
452
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
453
- author = "Reimers, Nils and Gurevych, Iryna",
454
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
455
- month = "11",
456
- year = "2019",
457
- publisher = "Association for Computational Linguistics",
458
- url = "https://arxiv.org/abs/1908.10084",
459
- }
460
- ```
461
-
462
- #### MatryoshkaLoss
463
- ```bibtex
464
- @misc{kusupati2024matryoshka,
465
- title={Matryoshka Representation Learning},
466
- author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
467
- year={2024},
468
- eprint={2205.13147},
469
- archivePrefix={arXiv},
470
- primaryClass={cs.LG}
471
- }
472
- ```
473
-
474
  <!--
475
  ## Glossary
476
 
 
9
  - generated_from_trainer
10
  - dataset_size:31500
11
  - loss:MatryoshkaLoss
 
12
  base_model: Ghani-25/LF_enrich_sim
13
  widget:
14
  - source_sentence: CTO and co-Founder
15
  sentences:
16
  - Responsable surpervision des départements
17
  - Senior sales executive
18
+ - >-
19
+ Injection Operations Supervisor - Industrial Efficiency - Systems &
20
+ Equipment
21
  - source_sentence: Commercial Account Executive
22
  sentences:
23
  - Automation Electrician
 
115
 
116
  # Our original base similarity Matryoshka
117
 
118
+ This is a [sentence-transformers] model finetuned from [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
119
 
120
  ## Model Details
121
 
 
130
  - **Language:** multilingual
131
  - **License:** apache-2.0
132
 
 
 
 
 
 
 
133
  ### Full Model Architecture
134
 
135
  ```
 
158
  # Run inference
159
  sentences = [
160
  'Summer Job: Export Manager',
161
+ 'Responsable Export Afrique Amériquess
162
  'Clinical Project Leader',
163
  ]
164
  embeddings = model.encode(sentences)
 
169
  similarities = model.similarity(embeddings, embeddings)
170
  print(similarities.shape)
171
  # [3, 3]
172
+
173
+ # Extraction de la diagonale pour obtenir les similarités correspondantes
174
+ similarities_diagonal = similarities.diag().cpu().numpy()
175
+ print(similarities_diagonal)
176
+ # [0.896542]
177
  ```
178
 
179
  <!--
 
285
  - `optim`: adamw_torch_fused
286
 
287
  #### All Hyperparameters
288
+ Contact the author.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
  ### Training Logs
291
  | Epoch | Step | Training Loss | dim_768_spearman_cosine | dim_512_spearman_cosine | dim_256_spearman_cosine | dim_128_spearman_cosine | dim_64_spearman_cosine |
 
330
  - Datasets: 2.19.1
331
  - Tokenizers: 0.19.1
332
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
  <!--
334
  ## Glossary
335