Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 14
How to use MinhPhuc0804/e5-docling-checkthat-task1-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MinhPhuc0804/e5-docling-checkthat-task1-v1")
sentences = [
"query: Favorable trends in ecological integrity across most European #rewilding areas yet #CAP subsidies and land‑use policies impede additional progress in certain instances. Take a look at our @user partnership with @user",
"passage: reduced against b. 1. 1. 7 variant. this reduction was also evident in sera from some convalescent patients.\n\ntitle: SARS-CoV-2 B.1.1.7 sensitivity to mRNA vaccine-elicited, convalescent and monoclonal antibodies\nDecreased B.1.1.7 neutralisation was also observed with monoclonal antibodies targeting the N-terminal domain (9 out of 10), the Receptor Binding Motif (RBM) (5 out of 31), but not in neutralising mAbs binding outside the RBM. Introduction of the E484K mutation in a B.1.1.7 background to reflect newly emerging viruses in the UK led to a more substantial loss of neutralising activity by vaccine-elicited antibodies and mAbs (19 out of 31) over that conferred by the B.1.1.7 mutations alone. E484K emergence on a B.1.1.7 background represents a threat to the vaccine BNT162b.",
"passage: ##es were also at a greater risk for covid - 19 - related - hospitalizations compared to those that were previously infected.\n\ntitle: Comparing SARS-CoV-2 natural immunity to vaccine-induced immunity: reinfections versus breakthrough infections\nConclusions This study demonstrated that natural immunity confers longer lasting and stronger protection against infection, symptomatic disease and hospitalization caused by the Delta variant of SARS-CoV-2, compared to the BNT162b2 two-dose vaccine-induced immunity. Individuals who were both previously infected with SARS-CoV-2 and given a single dose of the vaccine gained additional protection against the Delta variant.",
"passage: that rewilding scores have improved in five sites, but declined in two, partly due to competing socio ‐ economic trends.\n\ntitle: Expert‐based assessment of rewilding indicates progress at site‐level, yet challenges for upscaling\nMajor threats for rewilding progress are related to land‐use intensification policies and persecution of keystone species. Major determinants of rewilding success are its societal appeal and socio‐economic benefits to local people. We provide an assessment of rewilding that is crucial in improving its restoration outcomes and informed implementation at scale across Europe in this decade of ecosystem restoration."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/e5-large-v2. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for retrieval.
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
(1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
(2): Normalize({})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("MinhPhuc0804/clef-docling-e5-large-v2")
# Run inference
sentences = [
'query: Link between vitamin D intake and COVID-19 contagion and death rate | Scientific Reports #cndpoli',
'passage: title: Association between vitamin D supplementation and COVID-19 infection and mortality\nabstract: Abstract Vitamin D deficiency has long been associated with reduced immune function that can lead to viral infection. Several studies have shown that Vitamin D deficiency is associated with increases the risk of infection with COVID-19. However, it is unknown if treatment with Vitamin D can reduce the associated risk of COVID-19 infection, which is the focus of this study. In the population of US veterans, we show that Vitamin D 2 and D 3 fills were associated with reductions in COVID-19 infection of 28% and 20%, respectively [(D 3 Hazard Ratio (HR) = 0.80, [95% CI 0.77, 0.83]), D 2 HR = 0.72, [95% CI 0.65, 0.79]]. Mortality within 30-days of COVID-19 infection was similarly 33% lower with Vitamin D 3 and 25% lower with D 2 (D 3 HR = 0.67, [95% CI 0.59, 0.75]; D 2 HR = 0.75, [95% CI 0.55, 1.04]).',
'passage: title: Harm to Nonhuman Animals from AI: a Systematic Account and Framework\nabstract: Abstract This paper provides a systematic account of how artificial intelligence (AI) technologies could harm nonhuman animals and explains why animal harms, often neglected in AI ethics, should be better recognised. After giving reasons for caring about animals and outlining the nature of animal harm, interests, and wellbeing, the paper develops a comprehensive ‘harms framework’ which draws on scientist David Fraser’s influential mapping of human activities that impact on sentient animals. The harms framework is fleshed out with examples inspired by both scholarly literature and media reports. This systematic account and framework should help inform ethical analyses of AI’s impact on animals and serve as a comprehensive and clear basis for the development and regulation of AI technologies to prevent and mitigate harm to nonhumans.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.6243, -0.0993],
# [ 0.6243, 1.0000, 0.0158],
# [-0.0993, 0.0158, 1.0000]])
10-percent-dev-splitInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.5413 |
| cosine_accuracy@3 | 0.7439 |
| cosine_accuracy@5 | 0.813 |
| cosine_accuracy@10 | 0.8722 |
| cosine_precision@1 | 0.5413 |
| cosine_precision@3 | 0.248 |
| cosine_precision@5 | 0.1626 |
| cosine_precision@10 | 0.0872 |
| cosine_recall@1 | 0.5413 |
| cosine_recall@3 | 0.7439 |
| cosine_recall@5 | 0.813 |
| cosine_recall@10 | 0.8722 |
| cosine_ndcg@10 | 0.7072 |
| cosine_mrr@10 | 0.6541 |
| cosine_map@100 | 0.658 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
query: Everyday chatter creates a coronavirus cloud that persists for 'tens of minutes or longer'. Teachers talk. Teachers move. Kids talk. Kids move. A blanket reopening of schools now makes no sense. |
passage: title: The airborne lifetime of small speech droplets and their potential importance in SARS-CoV-2 transmission |
query: Les avalanches de cytokines provoquées par les macrophages durant le COVID-19 grave débutent par une infection du SRAS‑CoV‑2 de cellules dendritiques particulières qui génèrent ensuite d’importantes quantités d’interféron pro‑inflammatoire de type I. 🤔 |
passage: changes in macrophages at both transcriptional and epigenetic levels, which favored their hyperactivation by environmental stimuli. |
query: Nye tal fra UK vedr. myocarditis. The risk of myocarditis was greater after vaccination than illness in young men. |
passage: ##2 ( irr 2. 02, 95 % ci 1. 40, 2. 91 ). |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32num_train_epochs: 10fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 32per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | 10-percent-dev-split_cosine_ndcg@10 |
|---|---|---|---|
| 0.9225 | 500 | 0.7284 | - |
| 1.0 | 542 | - | 0.6813 |
| 1.8450 | 1000 | 0.2798 | - |
| 2.0 | 1084 | - | 0.7030 |
| 2.7675 | 1500 | 0.1728 | - |
| 3.0 | 1626 | - | 0.6994 |
| 3.6900 | 2000 | 0.1095 | - |
| 4.0 | 2168 | - | 0.7063 |
| 4.6125 | 2500 | 0.0804 | - |
| 5.0 | 2710 | - | 0.7036 |
| 5.5351 | 3000 | 0.063 | - |
| 6.0 | 3252 | - | 0.7052 |
| 6.4576 | 3500 | 0.0499 | - |
| 7.0 | 3794 | - | 0.7064 |
| 7.3801 | 4000 | 0.0439 | - |
| 8.0 | 4336 | - | 0.7052 |
| 8.3026 | 4500 | 0.0392 | - |
| 9.0 | 4878 | - | 0.7069 |
| 9.2251 | 5000 | 0.0398 | - |
| 10.0 | 5420 | - | 0.7072 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
intfloat/e5-large-v2