Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use MinhPhuc0804/me5-512-docling-checkthat-task1-v1.2 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MinhPhuc0804/me5-512-docling-checkthat-task1-v1.2")
sentences = [
"query: Thrilled to spot our study in @BJSM_BMJ on injury incidence & burden in youth football, taking into account the immature skeleton from a big cohort during 4 back-to-back seasons @aspetar @RoaldBahr",
"passage: title: Longitudinal study of six seasons of match injuries in elite female rugby union\nabstract: ObjectiveTo establish match injury rates and patterns in elite female rugby union players in England.We conducted a six-season (2011/2012-2013/2014 and 2017/2018-2019/2020) prospective cohort study of time-loss match injuries in elite-level female players in the English Premiership competition. A 24-hour time-loss definition was used.Five-hundred and thirty-four time-loss injuries were recorded during 13 680 hours of match exposure. Injury incidence was 39 injuries per 1000 hours (95% CIs 36 to 42) with a mean severity of 48 days (95% CIs 42 to 54) and median severity of 20 days (IQR: 7-57). Concussion was the most common specific injury diagnosis (five concussions per 1000 hours, 95% CIs 4 to 6). The tackle event was associated with the greatest burden of injury (615 days absence per 1000 hours 95% CIs 340 to 1112), with 'being tackled' specifically causing the most injuries (28% of all injuries) and concussions (22% of all concussions).This is the first multiple-season study of match injuries in elite women's rugby union players. Match injury incidence was similar to that previously reported within international women's rugby union. Injury prevention strategies centred on the tackle would focus on high-burden injuries, which are associated with substantial player time-loss and financial costs to teams as well as the high-priority area of concussions.",
"passage: title: Single, Dual, and Triple Use of Cigarettes, e-Cigarettes, and Snus among Adolescents in the Nordic Countries\nabstract: New tobacco and nicotine products have emerged on the market in recent years. Most research has concerned only one product at a time, usually e-cigarettes, while little is known about the multiple use of tobacco and nicotine products among adolescents. We examined single, dual, and triple use of cigarettes, e-cigarettes, and snus among Nordic adolescents, using data of 15–16-year-olds (n = 16,125) from the European School Survey Project on Alcohol and other Drugs (ESPAD) collected in 2015 and 2019 from Denmark, Finland, Iceland, Norway, Sweden, and the Faroe Islands. Country-specific lifetime use of any of these products ranged between 40% and 50%, and current use between 17% and 31%. Cigarettes were the most common product in all countries except for Iceland, where e-cigarettes were remarkably more common. The proportion of dual and triple users was unexpectedly high among both experimental (24%–49%) and current users (31–42%). Triple use was less common than dual use. The users’ patterns varied somewhat between the countries, and Iceland differed substantially from the other countries, with a high proportion of single e-cigarette users. More knowledge on the patterns of multiple use of tobacco and nicotine products and on the potential risk and protective factors is needed for targeted intervention and prevention efforts.",
"passage: title: Injury incidence and burden in a youth elite football academy: a four-season prospective study of 551 players aged from under 9 to under 19 years\nabstract: Objective Investigate the incidence and burden of injuries by age group in youth football (soccer) academy players during four consecutive seasons. Methods All injuries that caused time-loss or required medical attention (as per consensus definitions) were prospectively recorded in 551 youth football players from under 9 years to under 19 years. Injury incidence (II) and burden (IB) were calculated as number of injuries per squad season (s-s), as well as for type, location and age groups. Results A total of 2204 injuries were recorded. 40% (n=882) required medical attention and 60% (n=1322) caused time-loss. The total time-loss was 25 034 days. A squad of 25 players sustained an average of 30 time-loss injuries (TLI) per s-s with an IB of 574 days lost per s-s. Compared with the other age groups, U-16 players had the highest TLI incidence per s-s (95% CI lower-upper): II= 59 (52 to 67); IB=992 days; (963 to 1022) and U-18 players had the greatest burden per s-s: II= 42.1 (36.1 to 49.1); IB= 1408 days (1373 to 1444). Across the cohort of players, contusions (II=7.7/s-s), sprains (II=4.9/s-s) and growth-related injuries (II=4.3/s-s) were the most common TLI. Meniscus/cartilage injuries had the greatest injury severity (95% CI lower-upper): II= 0.4 (0.3 to 0.7), IB= 73 days (22 to 181). The burden (95% CI lower-upper) of physeal fractures (II= 0.8; 0.6 to 1.2; IB= 58 days; 33 to 78) was double than non-physeal fractures. Summary At this youth football academy, each squad of 25 players averaged 30 injuries per season which resulted in 574 days lost."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large-instruct. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for retrieval.
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'XLMRobertaModel'})
(1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
(2): Normalize({})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("MinhPhuc0804/me5-512-docling-checkthat-task1-v1.2")
# Run inference
sentences = [
'query: It’s been obvious for ages that mRNA vaccines constituted a 3+ dose series. A 3‑dose series is very effective. The fourth dose is still better and ought to be made available. Why does Canada still label partially (2 dose) vaccinated as “fully vaccinated”?',
'passage: title: Protection against omicron severe disease 0-7 months after BNT162b2 booster\nabstract: Abstract Following a rise in cases due to the delta variant and evidence of waning immunity after 2 doses of the BNT162b2 vaccine, Israel began administering a third BNT162b2 dose (booster) in July 2021. Recent studies showed that the 3rd dose provides a much lower protection against infection with the omicron variant compared to the delta variant and that this protection wanes quickly. In this study, we used data from Israel to estimate the protection of the 3rd dose against severe disease up to 7 months from receiving the booster dose. The analysis shows that protection conferred by the 3rd dose against omicron did not wane over a 7-month period and that a 4th dose further increased protection, with a severe disease rate approximately 3-fold lower than in the 3-dose cohorts.',
'passage: title: A fourth dose of the mRNA-1273 SARS-CoV-2 vaccine improves serum neutralization against the delta variant in kidney transplant recipients\nabstract: Abstract In immunocompetent subjects, the effectiveness of SARS-CoV-2 vaccines against the delta variant appears three- to five-fold lower than that observed against the alpha variant. Additionally, three doses of SARS-CoV-2 mRNA-based vaccines might be unable to elicit a sufficient immune response against any variant in immunocompromised kidney transplant recipients. This study describes the kinetics of the neutralizing antibody (NAbs) response against the delta strain before and after a fourth dose of a mRNA vaccine in 67 kidney transplant recipients who had experienced a weak antibody response after three doses. While only 16% of patients harbored NAbs against the delta strain prior to the fourth injection – this percentage raised to 66% afterwards. We also found that, after the fourth dose, the NAbs titer increased significantly (p=0.0001) from <7.5 (IQR : <7.5−15.1) to 47.1 (IQR <7.5−284.2). Collectively, our data indicate that a fourth dose of the mRNA-1273 vaccine in kidney transplant recipients with a weak antibody response after three previous doses improves serum neutralization against the delta variant.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8132, 0.2067],
# [0.8132, 1.0000, 0.1794],
# [0.2067, 0.1794, 1.0000]])
10-percent-dev-splitInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.4748 |
| cosine_accuracy@3 | 0.6582 |
| cosine_accuracy@5 | 0.7148 |
| cosine_accuracy@10 | 0.7782 |
| cosine_precision@1 | 0.4748 |
| cosine_precision@3 | 0.2194 |
| cosine_precision@5 | 0.143 |
| cosine_precision@10 | 0.0778 |
| cosine_recall@1 | 0.4748 |
| cosine_recall@3 | 0.6582 |
| cosine_recall@5 | 0.7148 |
| cosine_recall@10 | 0.7782 |
| cosine_ndcg@10 | 0.6259 |
| cosine_mrr@10 | 0.5771 |
| cosine_map@100 | 0.5824 |
sentence_0, sentence_1, and sentence_2| sentence_0 | sentence_1 | sentence_2 | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| sentence_0 | sentence_1 | sentence_2 |
|---|---|---|
query: I was fact-checked when I covered this topic for @user last year. Since the story back then was, apparently, that damp strips of fabric dangling over people's faces for hours on end couldn't possibly spawn anything nasty - because Science™!! |
passage: title: Bacterial and fungal isolation from face masks under the COVID-19 pandemic |
|
| abstract: Abstract The COVID-19 pandemic has led people to wear face masks daily in public. Although the effectiveness of face masks against viral transmission has been extensively studied, there have been few reports on potential hygiene issues due to bacteria and fungi attached to the face masks. We aimed to (1) quantify and identify the bacteria and fungi attaching to the masks, and (2) investigate whether the mask-attached microbes could be associated with the types and usage of the masks and individual lifestyles. We surveyed 109 volunteers on their mask usage and lifestyles, and cultured bacteria and fungi from either the face-side or outer-side of their masks. The bacterial colony numbers were greater on the face-side than the outer-side; the fungal colony numbers were fewer on the face-side than the outer-side. A longer mask usage significantly increased the fungal colony numbers but not ... | passage: is very low. |
query: @user If just the US government had some National Institution of Health entity which could’ve been showcasing, studying and verifying this type of advantage from data years earlier .. to apply immediately without political spin | passage: title: Chloroquine is a potent inhibitor of SARS coronavirus infection and spread
abstract: Abstract Background Severe acute respiratory syndrome (SARS) is caused by a newly discovered coronavirus (SARS-CoV). No effective prophylactic or post-exposure therapy is currently available. Results We report, however, that chloroquine has strong antiviral effects on SARS-CoV infection of primate cells. These inhibitory effects are observed when the cells are treated with the drug either before or after exposure to the virus, suggesting both prophylactic and therapeutic advantage. In addition to the well-known functions of chloroquine such as elevations of endosomal pH, the drug appears to interfere with terminal glycosylation of the cellular receptor, angiotensin-converting enzyme 2. This may negatively influence the virus-receptor binding and abrogate the infection, with further ramifications by the elevation of vesicular pH, resulting in the inhibition of infection and spread of SAR... | passage: title: A National Medical Response to Crisis — The Legacy of World War II
abstract: A National Medical Response to Crisis World War II’s massive casualties were mitigated by lives saved as a result of medical care. Many of the advances made would persist long after the war conclud... |
| query: UNDENIABLE EVIDENCE OF MY SPIKE PROTEIN TRIGGERED WIDESPREAD AMYLOIDOSES THEORY. IT. IS. OCCURRING. | passage: title: Amyloidogenesis of SARS-CoV-2 Spike Protein
abstract: ABSTRACT SARS-CoV-2 infection is associated with a surprising number of morbidities. Uncanny similarities with amyloid-disease associated blood coagulation and fibrinolytic disturbances together with neurologic and cardiac problems led us to investigate the amyloidogenicity of the SARS-CoV-2 Spike protein (S-protein). Amyloid fibril assays of peptide library mixtures and theoretical predictions identified seven amyloidogenic sequences within the S-protein. All seven peptides in isolation formed aggregates during incubation at 37°C. Three 20-amino acid long synthetic Spike peptides (sequence 191-210, 599-618, 1165-1184) fulfilled three amyloid fibril criteria: nucleation dependent polymerization kinetics by ThT, Congo red positivity and ultrastructural fibrillar morphology. Full-length folded S-protein did not form amyloid fibrils, but amyloid-like fibrils with evident branching were formed during 24 hours of S-protei... | passage: title: Amyloidogenesis of SARS-CoV-2 Spike Protein
abstract: SARS-CoV-2 infection is associated with a surprising number of morbidities. Uncanny similarities with amyloid-disease associated blood coagulation and fibrinolytic disturbances together with neurologic and cardiac problems led us to investigate the amyloidogenicity of the SARS-CoV-2 spike protein (S-protein). Amyloid fibril assays of peptide library mixtures and theoretical predictions identified seven amyloidogenic sequences within the S-protein. All seven peptides in isolation formed aggregates during incubation at 37 °C. Three 20-amino acid long synthetic spike peptides (sequence 192–211, 601–620, 1166–1185) fulfilled three amyloid fibril criteria: nucleation dependent polymerization kinetics by ThT, Congo red positivity, and ultrastructural fibrillar morphology. Full-length folded S-protein did not form amyloid fibrils, but amyloid-like fibrils with evident branching were formed during 24 h of S-protein coincubat... |main.TripletMNRLCombinedLossper_device_train_batch_size: 48per_device_eval_batch_size: 48num_train_epochs: 20fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseprediction_loss_only: Trueper_device_train_batch_size: 48per_device_eval_batch_size: 48per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 20max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | 10-percent-dev-split_cosine_ndcg@10 |
|---|---|---|---|
| 1.0 | 361 | - | 0.6980 |
| 1.3850 | 500 | 1.6273 | - |
| 2.0 | 722 | - | 0.7033 |
| 2.7701 | 1000 | 0.9528 | - |
| 3.0 | 1083 | - | 0.7110 |
| 4.0 | 1444 | - | 0.6994 |
| 4.1551 | 1500 | 0.6268 | - |
| 5.0 | 1805 | - | 0.6933 |
| 5.5402 | 2000 | 0.4279 | - |
| 6.0 | 2166 | - | 0.6883 |
| 6.9252 | 2500 | 0.3117 | - |
| 7.0 | 2527 | - | 0.6620 |
| 8.0 | 2888 | - | 0.6707 |
| 8.3102 | 3000 | 0.2262 | - |
| 9.0 | 3249 | - | 0.6671 |
| 9.6953 | 3500 | 0.1799 | - |
| 10.0 | 3610 | - | 0.6579 |
| 11.0 | 3971 | - | 0.6470 |
| 11.0803 | 4000 | 0.139 | - |
| 12.0 | 4332 | - | 0.6469 |
| 12.4654 | 4500 | 0.1094 | - |
| 13.0 | 4693 | - | 0.6415 |
| 13.8504 | 5000 | 0.0911 | - |
| 14.0 | 5054 | - | 0.6439 |
| 15.0 | 5415 | - | 0.6284 |
| 15.2355 | 5500 | 0.0755 | - |
| 16.0 | 5776 | - | 0.6272 |
| 16.6205 | 6000 | 0.0664 | - |
| 17.0 | 6137 | - | 0.6290 |
| 18.0 | 6498 | - | 0.6253 |
| 18.0055 | 6500 | 0.0573 | - |
| 19.0 | 6859 | - | 0.6275 |
| 19.3906 | 7000 | 0.052 | - |
| 20.0 | 7220 | - | 0.6259 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Base model
intfloat/multilingual-e5-large-instruct