Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 14
How to use MinhPhuc0804/me5-docling-checkthat-task1-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MinhPhuc0804/me5-docling-checkthat-task1-v1")
sentences = [
"query: Our work is not trauma-informed unless it is rooted in anti-racist, anti-oppressive, culturally sustaining, liberating practices. Stolen Breaths | NEJM",
"passage: title: Long COVID symptoms and duration in SARS-CoV-2 positive children — a nationwide cohort study\nabstract: Most children have a mild course of acute COVID-19. Only few mainly non-controlled studies with small sample size have evaluated long-term recovery from SARS-CoV-2 infection in children. The aim of this study was to evaluate symptoms and duration of 'long COVID' in children. A nationwide cohort study of 37,522 children aged 0-17 years with RT-PCR verified SARS-CoV-2 infection (response rate 44.9%) and a control group of 78,037 children (response rate 21.3%). An electronic questionnaire was sent to all children from March 24th until May 9th, 2021. Symptoms lasting > 4 weeks were common among both SARS-CoV-2 children and controls. However, SARS-CoV-2 children aged 6-17 years reported symptoms more frequently than the control group (percent difference 0.8%).",
"passage: ke-specific antibodies to mediate antibody-dependent cellular phagocytosis and complement deposition.\n\ntitle: Class switch toward noninflammatory, spike-specific IgG4 antibodies after repeated SARS-CoV-2 mRNA vaccination\nBecause Fc-mediated effector functions are critical for antiviral immunity, these findings may have consequences for the choice and timing of vaccination regimens using mRNA vaccines, including future booster immunizations against SARS-CoV-2.",
"passage: title: Stolen Breaths\nabstract: In the wake of George Floyd’s public execution, uprisings have ignited in cities throughout the United States The words “I can’t breathe” hang heavy in the air Black people cannot breathe because we are currently battling at least two public health emergencies"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large-instruct. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for retrieval.
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'XLMRobertaModel'})
(1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
(2): Normalize({})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("MinhPhuc0804/me5-docling-checkthat-task1-v1")
# Run inference
sentences = [
"query: @user You're more prone to get COVID-19 with more of the jabs. They're aiming to cut down Earth's population.",
'passage: doses. Results Among 51017 employees, COVID-19 occurred in 4424 (8.7%) during the study.\n\ntitle: Effectiveness of the Coronavirus Disease 2019 (COVID-19) Bivalent Vaccine\nIn multivariable analysis, the bivalent vaccinated state was associated with lower risk of COVID-19 during the BA.4/5 dominant (HR, .71; 95% C.I., .63-.79) and the BQ dominant (HR, .80; 95% C.I., .69-.94) phases, but decreased risk was not found during the XBB dominant phase (HR, .96; 95% C.I., .82-.1.12). Estimated vaccine effectiveness (VE) was 29% (95% C.I., 21%-37%), 20% (95% C.I., 6%-31%), and 4% (95% C.I., -12%-18%), during the BA.4/5, BQ, and XBB dominant phases, respectively. Risk of COVID-19 also increased with time since most recent prior COVID-19 episode and with the number of vaccine doses previously received.',
'passage: d the bacterial counts from the same surgeon, a significant increase was noted in the 2-hours group.\n\ntitle: Surgical masks as source of bacterial contamination during operative procedures\nMoreover, the bacterial counts were significantly higher among the surgeons than the OR. Additionally, the bacterial count of the external surface of the second mask was significantly higher than that of the first one. The source of bacterial contamination in SMs was the body surface of the surgeons rather than the OR environment. Moreover, we recommend that surgeons should change the mask after each operation, especially those beyond 2 hours. Double-layered SMs or those with excellent filtration function may also be a better alternative. This study provides strong evidence for the identification that SMs as source of bacterial contamination during operative procedures, which should be a cause for alarm and attention in the prevention of surgical site infection in clinical practice.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6161, 0.0673],
# [0.6161, 1.0000, 0.0795],
# [0.0673, 0.0795, 1.0000]])
10-percent-dev-splitInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.5621 |
| cosine_accuracy@3 | 0.7699 |
| cosine_accuracy@5 | 0.827 |
| cosine_accuracy@10 | 0.8784 |
| cosine_precision@1 | 0.5621 |
| cosine_precision@3 | 0.2566 |
| cosine_precision@5 | 0.1654 |
| cosine_precision@10 | 0.0878 |
| cosine_recall@1 | 0.5621 |
| cosine_recall@3 | 0.7699 |
| cosine_recall@5 | 0.827 |
| cosine_recall@10 | 0.8784 |
| cosine_ndcg@10 | 0.7248 |
| cosine_mrr@10 | 0.675 |
| cosine_map@100 | 0.6791 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
query: Financial ties between heads of powerful US professional health societies and the corporate sector: cross‑sectional study | The BMJ |
passage: title: Financial ties between leaders of influential US professional medical associations and industry: cross sectional study |
| abstract: To investigate the nature and extent of financial relationships between leaders of influential professional medical associations in the United States and pharmaceutical and device companies.Cross sectional study.Professional associations for the 10 costliest disease areas in the US according to the US Agency for Healthcare Research and Quality. Financial data for association leadership, 2017-19, were obtained from the Open Payments database.328 leaders, such as board members, of 10 professional medical associations: American College of Cardiology, Orthopaedic Trauma Association, American Psychiatric Association, Endocrine Society, American College of Rheumatology, American Society of Clinical Oncology, American Thoracic Society, North American Spine Society, Infectious Diseases Society of America, and American College of Physicians.Proportion ... | |
query: Récente recherche du Mexique sur l'ivermectine associée aux #azithromycine, #montelukast & aspirine sur 768 patients. La mortalité était diminuée de 81%, et 74% de baisse des hospitalisations. Un rétablissement 3,4 fois plus rapide pour les #Covid19. En prépublication |
passage: title: Effectiveness of a multidrug therapy consisting of Ivermectin, Azithromycin, Montelukast, and Acetylsalicylic acid to prevent hospitalization and death among ambulatory COVID-19 cases in Tlaxcala, Mexico |
query: 🧠💥 Suite à un #AVC, on peut constater une réponse inflammatoire persistante qui favorise alors une altération cognitive. La mise d'une molécule à des #souris 🐁 a autorisé de restaurer le métabolisme lipidique dans le #cerveau et diminuer ce risque |
passage: in the striatum and thalamus and c-Fos immunoreactivity in hippocampal regions. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32num_train_epochs: 10fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 32per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | 10-percent-dev-split_cosine_ndcg@10 |
|---|---|---|---|
| 0.9225 | 500 | 0.6227 | - |
| 1.0 | 542 | - | 0.7037 |
| 1.8450 | 1000 | 0.2537 | - |
| 2.0 | 1084 | - | 0.7074 |
| 2.7675 | 1500 | 0.1664 | - |
| 3.0 | 1626 | - | 0.7144 |
| 3.6900 | 2000 | 0.1109 | - |
| 4.0 | 2168 | - | 0.7159 |
| 4.6125 | 2500 | 0.0779 | - |
| 5.0 | 2710 | - | 0.7230 |
| 5.5351 | 3000 | 0.0664 | - |
| 6.0 | 3252 | - | 0.7161 |
| 6.4576 | 3500 | 0.0563 | - |
| 7.0 | 3794 | - | 0.7142 |
| 7.3801 | 4000 | 0.0478 | - |
| 8.0 | 4336 | - | 0.7198 |
| 8.3026 | 4500 | 0.0379 | - |
| 9.0 | 4878 | - | 0.7243 |
| 9.2251 | 5000 | 0.0379 | - |
| 10.0 | 5420 | - | 0.7248 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
intfloat/multilingual-e5-large-instruct