SentenceTransformer based on intfloat/e5-large-v2

This is a sentence-transformers model finetuned from intfloat/e5-large-v2. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/e5-large-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modality: Text

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
  (1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'mean', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MinhPhuc0804/clef-docling-e5-large-v2")
# Run inference
sentences = [
    'query: Link between vitamin D intake and COVID-19 contagion and death rate | Scientific Reports #cndpoli',
    'passage: title: Association between vitamin D supplementation and COVID-19 infection and mortality\nabstract: Abstract Vitamin D deficiency has long been associated with reduced immune function that can lead to viral infection. Several studies have shown that Vitamin D deficiency is associated with increases the risk of infection with COVID-19. However, it is unknown if treatment with Vitamin D can reduce the associated risk of COVID-19 infection, which is the focus of this study. In the population of US veterans, we show that Vitamin D 2 and D 3 fills were associated with reductions in COVID-19 infection of 28% and 20%, respectively [(D 3 Hazard Ratio (HR) = 0.80, [95% CI 0.77, 0.83]), D 2 HR = 0.72, [95% CI 0.65, 0.79]]. Mortality within 30-days of COVID-19 infection was similarly 33% lower with Vitamin D 3 and 25% lower with D 2 (D 3 HR = 0.67, [95% CI 0.59, 0.75]; D 2 HR = 0.75, [95% CI 0.55, 1.04]).',
    'passage: title: Harm to Nonhuman Animals from AI: a Systematic Account and Framework\nabstract: Abstract This paper provides a systematic account of how artificial intelligence (AI) technologies could harm nonhuman animals and explains why animal harms, often neglected in AI ethics, should be better recognised. After giving reasons for caring about animals and outlining the nature of animal harm, interests, and wellbeing, the paper develops a comprehensive ‘harms framework’ which draws on scientist David Fraser’s influential mapping of human activities that impact on sentient animals. The harms framework is fleshed out with examples inspired by both scholarly literature and media reports. This systematic account and framework should help inform ethical analyses of AI’s impact on animals and serve as a comprehensive and clear basis for the development and regulation of AI technologies to prevent and mitigate harm to nonhumans.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.6243, -0.0993],
#         [ 0.6243,  1.0000,  0.0158],
#         [-0.0993,  0.0158,  1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5413
cosine_accuracy@3 0.7439
cosine_accuracy@5 0.813
cosine_accuracy@10 0.8722
cosine_precision@1 0.5413
cosine_precision@3 0.248
cosine_precision@5 0.1626
cosine_precision@10 0.0872
cosine_recall@1 0.5413
cosine_recall@3 0.7439
cosine_recall@5 0.813
cosine_recall@10 0.8722
cosine_ndcg@10 0.7072
cosine_mrr@10 0.6541
cosine_map@100 0.658

Training Details

Training Dataset

Unnamed Dataset

  • Size: 17,319 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 21 tokens
    • mean: 57.37 tokens
    • max: 129 tokens
    • min: 26 tokens
    • mean: 186.91 tokens
    • max: 256 tokens
  • Samples:
    sentence_0 sentence_1
    query: Everyday chatter creates a coronavirus cloud that persists for 'tens of minutes or longer'. Teachers talk. Teachers move. Kids talk. Kids move. A blanket reopening of schools now makes no sense. passage: title: The airborne lifetime of small speech droplets and their potential importance in SARS-CoV-2 transmission
    abstract: Speech droplets generated by asymptomatic carriers of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are increasingly considered to be a likely mode of disease transmission. Highly sensitive laser light scattering observations have revealed that loud speech can emit thousands of oral fluid droplets per second. In a closed, stagnant air environment, they disappear from the window of view with time constants in the range of 8 to 14 min, which corresponds to droplet nuclei of ca. 4 μm diameter, or 12- to 21-μm droplets prior to dehydration. These observations confirm that there is a substantial probability that normal speaking causes airborne virus transmission in confined environments.
    query: Les avalanches de cytokines provoquées par les macrophages durant le COVID-19 grave débutent par une infection du SRAS‑CoV‑2 de cellules dendritiques particulières qui génèrent ensuite d’importantes quantités d’interféron pro‑inflammatoire de type I. 🤔 passage: changes in macrophages at both transcriptional and epigenetic levels, which favored their hyperactivation by environmental stimuli.

    title: Sensing of SARS-CoV-2 by pDCs and their subsequent production of IFN-I contribute to macrophage-induced cytokine storm during COVID-19
    Together, these data indicate that the priming of macrophages can result from the response by pDCs to SARS-CoV-2, leading to macrophage activation in patients with severe COVID-19.
    query: Nye tal fra UK vedr. myocarditis. The risk of myocarditis was greater after vaccination than illness in young men. passage: ##2 ( irr 2. 02, 95 % ci 1. 40, 2. 91 ).

    title: Risk of myocarditis following sequential COVID-19 vaccinations by age and sex
    Associations were strongest in males younger than 40 years for all vaccine types with an additional 3 (95%CI 1, 5) and 12 (95% CI 1,17) events per million estimated in the 1-28 days following a first dose of BNT162b2 and mRNA-1273, respectively; 14 (95%CI 8, 17), 12 (95%CI 1, 7) and 101 (95%CI 95, 104) additional events following a second dose of ChAdOx1, BNT162b2 and mRNA-1273, respectively; and 13 (95%CI 7, 15) additional events following a third dose of BNT162b2, compared with 7 (95%CI 2, 11) additional events following COVID-19 infection. An association between COVID-19 infection and myocarditis was observed in all ages for both sexes but was substantially higher in those older than 40 years. These findings have important implications for public health and vaccination policy. Funding Health Data Research UK.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 10
  • fp16: True
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss 10-percent-dev-split_cosine_ndcg@10
0.9225 500 0.7284 -
1.0 542 - 0.6813
1.8450 1000 0.2798 -
2.0 1084 - 0.7030
2.7675 1500 0.1728 -
3.0 1626 - 0.6994
3.6900 2000 0.1095 -
4.0 2168 - 0.7063
4.6125 2500 0.0804 -
5.0 2710 - 0.7036
5.5351 3000 0.063 -
6.0 3252 - 0.7052
6.4576 3500 0.0499 -
7.0 3794 - 0.7064
7.3801 4000 0.0439 -
8.0 4336 - 0.7052
8.3026 4500 0.0392 -
9.0 4878 - 0.7069
9.2251 5000 0.0398 -
10.0 5420 - 0.7072

Training Time

  • Training: 1.1 hours

Framework Versions

  • Python: 3.12.6
  • Sentence Transformers: 5.4.1
  • Transformers: 4.56.0
  • PyTorch: 2.8.0+cu129
  • Accelerate: 1.10.1
  • Datasets: 4.8.4
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MinhPhuc0804/e5-docling-checkthat-task1-v1

Finetuned
(34)
this model

Papers for MinhPhuc0804/e5-docling-checkthat-task1-v1

Evaluation results