Edit model card

SentenceTransformer based on Lajavaness/bilingual-embedding-large

This is a sentence-transformers model finetuned from Lajavaness/bilingual-embedding-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Lajavaness/bilingual-embedding-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("bilingual-embedding-large-4_batch_10_epoch_all_data_en_unique_split")
# Run inference
sentences = [
    'When administrators are released from their tasks, all personal administration identifiers assigned to them must be blocked. It must be checked which passwords the outgoing employees still know. Such passwords must be changed. Furthermore, it must be checked whether the outgoing employees have been appointed as contact persons to third parties, e.g. in contracts or as an admin-C entry at Internet-Domains. In this case, new contact persons must be identified and the interested third parties informed. Users of the affected IT systems and applications must be informed that the previous administrator has left.',
    'The full life cycle of identities shall be managed.',
    'A.5.16',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

  • Dataset: bilingual-embedding-large-4_batch_10_epoch_all_data_en_unique_split_robustness_43_eval
  • Evaluated with TripletEvaluator
Metric Value
cosine_accuracy 0.9072
dot_accuracy 0.0677
manhattan_accuracy 0.9083
euclidean_accuracy 0.9072
max_accuracy 0.9083

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,435 training samples
  • Columns: anchor, positive, ISO_ID, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive ISO_ID negative
    type string string string string
    details
    • min: 12 tokens
    • mean: 85.89 tokens
    • max: 316 tokens
    • min: 10 tokens
    • mean: 26.39 tokens
    • max: 207 tokens
    • min: 3 tokens
    • mean: 4.99 tokens
    • max: 5 tokens
    • min: 10 tokens
    • mean: 25.4 tokens
    • max: 207 tokens
  • Samples:
    anchor positive ISO_ID negative
    The Cloud Service Provider applies appropriate measures to check the cloud service for vulnerabilities which might have been integrated into the cloud service during the software development process.

    The procedures for identifying such vulnerabilities are part of the software development process and, depending on a risk assessment, include the following activities:

    • Static Application Security Testing;

    • Dynamic Application Security Testing;

    • Code reviews by the Cloud Service Provider's subject matter experts; and

    • Obtaining information about confirmed vulnerabilities in software libraries provided by third parties and used in their own cloud service.

    The severity of identified vulnerabilities is assessed according to defined criteria and measures are taken to immediately eliminate or mitigate them.
    Information about technical vulnerabilities of information systems in use shall be obtained, the organization’s exposure to such vulnerabilities shall be evaluated and appropriate measures shall be taken. A.8.8 Backup copies of information, software and systems shall be maintained and regularly tested in accordance with the agreed topic-specific policy on backup.
    Policies and instructions for planning and conducting audits are documented, communicated and made available in accordance with SP-01 and address the following aspects:

    • Restriction to read-only access to system components in accordance with the agreed audit plan and as necessary to perform the activities;

    • Activities that may result in malfunctions to the cloud service or breaches of contractual requirements are performed during scheduled maintenance windows or outside peak periods; and

    • Logging and monitoring of activities.
    Audit tests and other assurance activities involving assessment of operational systems shall be planned and agreed between the tester and appropriate management. A.8.34 The organization shall provide a mechanism for personnel to report observed or suspected information security events through appropriate channels in a timely manner.
    System components in the Cloud Service Provider's area of responsibility that are used to provide the cloud service, authenticate users of the Cloud Service Provider's internal and external employees as well as system components that are involved in the Cloud Service Provider's automated authorisation processes. Access to the production environment requires two-factor or multi-factor authentication. Within the production environment, user authentication takes place through passwords, digitally signed certificates or procedures that achieve at least an equivalent level of security. If digitally signed certificates are used, administration is carried out in accordance with the Guideline for Key Management (cf. CRY-01). The password requirements are derived from a risk assessment and documented, communicated and provided in a password policy according to SP-01. Compliance with the requirements is enforced by the configuration of the system components, as far as technically possible. Allocation and management of authentication information shall be controlled by a management process, including advising personnel on appropriate handling of authentication information. A.5.17 Networks and network devices shall be secured, managed and controlled to protect information in systems and applications.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 916 evaluation samples
  • Columns: anchor, positive, ISO_ID, and negative
  • Approximate statistics based on the first 916 samples:
    anchor positive ISO_ID negative
    type string string string string
    details
    • min: 14 tokens
    • mean: 83.7 tokens
    • max: 213 tokens
    • min: 10 tokens
    • mean: 35.48 tokens
    • max: 309 tokens
    • min: 3 tokens
    • mean: 4.9 tokens
    • max: 5 tokens
    • min: 10 tokens
    • mean: 34.86 tokens
    • max: 309 tokens
  • Samples:
    anchor positive ISO_ID negative
    The Cloud Service Provider provides a training program for regular, target group-oriented security training and awareness for internal and external employees on standards and methods of secure software development and provision as well as on how to use the tools used for this purpose. The program is regularly reviewed and updated with regard to the applicable policies and instructions, the assigned roles and responsibilities and the tools used. The organization shall:
    a) determine the necessary competence of person(s) doing work under its control that affects its information security performance;
    b) ensure that these persons are competent on the basis of appropriate education, training, or experience;
    c) where applicable, take actions to acquire the necessary competence, and evaluate the effectiveness of the actions taken; and
    d) retain appropriate documented information as evidence of competence.
    NOTE Applicable actions can include, for example: the provision of training to, the mentoring of, or the re- assignment of current employees; or the hiring or contracting of competent persons.
    7.2 Knowledge gained from information security incidents shall be used to strengthen and improve the information security controls.
    The Cloud Service Provider provides a training program for regular, target group-oriented security training and awareness for internal and external employees on standards and methods of secure software development and provision as well as on how to use the tools used for this purpose. The program is regularly reviewed and updated with regard to the applicable policies and instructions, the assigned roles and responsibilities and the tools used. Personnel of the organization and relevant interested parties shall receive appropriate information security awareness, education and training and regular updates of the organization's information security policy, topic-specific policies and procedures, as relevant for their job function. A.6.3 Rules for the effective use of cryptography, including cryptographic key management, shall be defined and implemented.
    The Cloud Service Provider provides a training program for regular, target group-oriented security training and awareness for internal and external employees on standards and methods of secure software development and provision as well as on how to use the tools used for this purpose. The program is regularly reviewed and updated with regard to the applicable policies and instructions, the assigned roles and responsibilities and the tools used. Changes to information processing facilities and information systems shall be subject to change management procedures. A.8.32 Security perimeters shall be defined and used to protect areas that contain information and other associated assets.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • num_train_epochs: 10
  • warmup_ratio: 0.1
  • bf16: True
  • ddp_find_unused_parameters: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: True
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: True
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss bilingual-embedding-large-4_batch_10_epoch_all_data_en_unique_split_robustness_43_eval_cosine_accuracy
1.0 429 1.4247 1.1282 0.8657
2.0 858 0.9088 0.9658 0.8908
3.0 1287 0.6702 0.9134 0.8963
4.0 1716 0.5573 0.9099 0.9007
5.0 2145 0.48 0.8961 0.9116
6.0 2574 0.4304 0.9118 0.9061
7.0 3003 0.3893 0.9184 0.9072
8.0 3432 0.3781 0.9245 0.9072
9.0 3861 0.366 0.9238 0.9061
10.0 4290 0.3609 0.9250 0.9072

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.1.0
  • Transformers: 4.45.1
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.20.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
2
Safetensors
Model size
560M params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Behrni/bilingual-embedding-large-4_batch_10_epoch_all_data_en_unique_split_robustness_43

Finetuned
(2)
this model

Evaluation results

  • Cosine Accuracy on bilingual embedding large 4 batch 10 epoch all data en unique split robustness 43 eval
    self-reported
    0.907
  • Dot Accuracy on bilingual embedding large 4 batch 10 epoch all data en unique split robustness 43 eval
    self-reported
    0.068
  • Manhattan Accuracy on bilingual embedding large 4 batch 10 epoch all data en unique split robustness 43 eval
    self-reported
    0.908
  • Euclidean Accuracy on bilingual embedding large 4 batch 10 epoch all data en unique split robustness 43 eval
    self-reported
    0.907
  • Max Accuracy on bilingual embedding large 4 batch 10 epoch all data en unique split robustness 43 eval
    self-reported
    0.908