final-DPR-8e-05 / README.md
LequeuISIR's picture
Add new SentenceTransformer model
504587c verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:478146
  - loss:CoSENTLoss
widget:
  - source_sentence: >-
      However, its underutilization is mainly due to the absence of a concrete
      and coherent dissemination strategy.
    sentences:
      - >-
        At the same time, they need to understand that living in Europe brings
        great responsibilities in addition to great benefits.
      - >-
        The mainstay of any intelligent and patriotic mineral policy can be
        summed up in the following postulate: "since minerals are exhaustible,
        they should only be exploited with the maximum return for the economy of
        the country where they are mined".
      - >-
        We must move quickly to a shared sustainable energy supply, sustainable
        transportation and clean air.
  - source_sentence: >-
      Their track record shows they do not support Australia<92>s traditional
      industries because they are constantly pandering to the Greens.
    sentences:
      - >-
        An economic dynamic based on the sustainable development of national
        potential, equitable access to the means of production, social justice,
        environmental conservation, the incorporation of added value, the
        promotion of competitiveness and self-management,
      - >-
        the cry "El campo no aguanta más" (The countryside can't take it
        anymore), of the peasant movement and its proclamation of "Salvemos al
        Campo para salvar a México" (Let's save the countryside to save Mexico);
      - >-
        On the other hand, increasing defence capacity is directly related to
        the involvement of all citizens in appropriate programmes, which,
        together with the acquisition of skills, experience and organisation,
        also contribute to forging a spirit of militancy and collectivity.
  - source_sentence: >-
      We will prepare the proposals of the United Nations Declaration on the
      Rights of the Child in line with the commitments made.
    sentences:
      - >-
        For the presentation of Czech culture, we will also use the upcoming
        major anniversaries (100 years of the founding of Czechoslovakia, the
        30th anniversary of the canonization of Agnes of Bohemia, 600 years
        since the birth of George of Poděbrady, etc.).
      - >-
        Separate prison units for young people should be established, and
        special rehabilitation measures should be introduced in these units.
      - >-
        Austrian citizenship is a valuable asset and should not become
        accessible to those who do not abide by the laws of our state.
  - source_sentence: >-
      Third, CD&V wants to strengthen the social sustainability of our
      agriculture and horticulture sector.
    sentences:
      - >-
        We will take a farm-level approach where possible so that low-emissions
        farmers are rewarded with a lower cost through the ETS, rather than the
        current approach that assumes each cow, for instance, has the same
        emissions on every farm.
      - >-
        In addition, 20 billion euros in tax revenues are fraudulently evaded
        every year (the equivalent of the healthcare budget).
      - >-
        87 percent of arrested undocumented migrants are released sooner or
        later, but without papers, in a lawless situation.
  - source_sentence: >-
      This incites social hatred, threatens economic and social stability, and
      undermines trust in the authorities.
    sentences:
      - "\_The conditions for a healthy entrepreneurship, where the most innovative and creative win and where the source of enrichment cannot be property speculation or guilds and networks.   "
      - >-
        According to statistics from the Attorney General's Office, since
        February 2005, when the implementation of the PSD was announced, the
        rate of violent deaths per 100,000 inhabitants has dropped from 26.41 in
        December 2005 to 18.43 in December 2007.
      - >-
        As a result, the profits of the oligarchs are more than 400 times what
        our entire country gets from the exploitation of natural resources.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer

This is a sentence-transformers model trained on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("LequeuISIR/final-DPR-8e-05")
# Run inference
sentences = [
    'This incites social hatred, threatens economic and social stability, and undermines trust in the authorities.',
    '\xa0The conditions for a healthy entrepreneurship, where the most innovative and creative win and where the source of enrichment cannot be property speculation or guilds and networks.   ',
    'As a result, the profits of the oligarchs are more than 400 times what our entire country gets from the exploitation of natural resources.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 478,146 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 17 tokens
    • mean: 33.73 tokens
    • max: 107 tokens
    • min: 16 tokens
    • mean: 33.84 tokens
    • max: 101 tokens
    • 0: ~57.50%
    • 1: ~4.10%
    • 2: ~38.40%
  • Samples:
    sentence1 sentence2 label
    There have also been other important structural changes in the countryside, which have come together to form this new, as yet unknown, country. Meanwhile, investment, which is the way to increase production, employment capacity and competitiveness of the economy, fell from 20% of output in 1974 to only 11.8% on average between 1984 and 1988. 0
    Introduce new visa categories so we can be responsive to humanitarian needs and incentivise greater investment in our domestic infrastructure and regional economies The purpose of the project is to design and implement public policies aimed at achieving greater and faster inclusion of immigrants. 2
    and economic crimes that seriously and generally affect the fundamental rights of individuals and the international community as a whole. For the first time in the history, not only of Ecuador, but of the entire world, a government promoted a public audit process of the foreign debt and declared some of its tranches illegitimate and immoral. 0
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Evaluation Dataset

json

  • Dataset: json
  • Size: 478,146 evaluation samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 17 tokens
    • mean: 33.62 tokens
    • max: 103 tokens
    • min: 16 tokens
    • mean: 34.48 tokens
    • max: 111 tokens
    • 0: ~57.30%
    • 1: ~2.90%
    • 2: ~39.80%
  • Samples:
    sentence1 sentence2 label
    The anchoring of the Slovak Republic in the European Union allows citizens to feel: secure politically, secure economically, secure socially. Radikale Venstre wants Denmark to participate fully and firmly in EU cooperation on immigration, asylum and cross-border crime. 2
    Portugal's participation in the Community's negotiation of the next financial perspective should also be geared in the same direction. Given the dynamic international framework, safeguarding the national interest requires adjustments to each of these vectors. 2
    On asylum, the Green Party will: Dismantle the direct provision system and replace it with an efficient and humane system for determining the status of asylum seekers The crisis in the coal sector subsequently forced these immigrant workers to move into other economic sectors such as metallurgy, chemicals, construction and transport. 2
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 8e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.05
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss
0.0837 500 0.7889 9.5828
0.1673 1000 1.2158 9.3274
0.2510 1500 1.8215 9.4274
0.3346 2000 2.3548 8.2583
0.4183 2500 2.7493 8.1446
0.5019 3000 2.8998 7.9046
0.5856 3500 2.9298 8.0640
0.6692 4000 2.9053 7.2746
0.7529 4500 3.0905 7.5099
0.8365 5000 3.1864 7.3883
0.9202 5500 3.2322 6.9968
1.0038 6000 3.1194 7.4682
1.0875 6500 3.0122 7.7295
1.1712 7000 3.0453 7.1696
1.2548 7500 2.9439 7.2775
1.3385 8000 3.1108 7.4838
1.4221 8500 2.8512 7.5204
1.5058 9000 2.9865 7.4528
1.5894 9500 2.9995 8.0682
1.6731 10000 3.1073 7.5344
1.7567 10500 3.0631 7.4572
1.8404 11000 2.9915 7.4961
1.9240 11500 3.0445 7.3575
2.0077 12000 2.9501 7.9786
2.0914 12500 2.3377 8.6208
2.1750 13000 2.2833 8.8356
2.2587 13500 2.2785 8.8709
2.3423 14000 2.3012 8.6250
2.4260 14500 2.3488 8.1099
2.5096 15000 2.095 9.2305
2.5933 15500 2.4123 8.6405
2.6769 16000 2.2236 8.7805
2.7606 16500 2.3367 8.7110
2.8442 17000 2.1159 8.6447
2.9279 17500 2.1622 8.7123
3.0115 18000 2.1916 9.0314
3.0952 18500 1.604 9.3373
3.1789 19000 1.4116 9.6509
3.2625 19500 1.4036 9.9127
3.3462 20000 1.5392 9.8093
3.4298 20500 1.5791 9.8325
3.5135 21000 1.5343 9.7822
3.5971 21500 1.3913 9.6243
3.6808 22000 1.5151 9.9644
3.7644 22500 1.3922 9.7816
3.8481 23000 1.3361 9.5338
3.9317 23500 1.3363 9.8282
4.0154 24000 1.2234 10.2117
4.0990 24500 0.5927 10.4107
4.1827 25000 0.6879 10.4405
4.2664 25500 0.6832 10.5138
4.3500 26000 0.6514 10.2798
4.4337 26500 0.7396 10.3250
4.5173 27000 0.6813 10.4115
4.6010 27500 0.765 10.1365
4.6846 28000 0.5915 10.2402
4.7683 28500 0.5028 10.3197
4.8519 29000 0.5306 10.3270
4.9356 29500 0.5886 10.3543

Framework Versions

  • Python: 3.9.21
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}