Edit model card

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v7")
# Run inference
sentences = [
    '\n\nOblige with Data Localization Requirements:\n\nOblige with Product Safety and Certifications Requirements:\n\nFulfill Content Monitoring Requirements:\n\nChina’s Cybersecurity Law (the “CSL”), which went into effect on June 1st, 2017, applies to the construction, operation, maintenance, and use of information networks, and the supervision and administration of cybersecurity in China. The CSL provides guidelines on cybersecurity requirements for safeguarding Chinese cyberspace. The law protects the legal interests and rights of organizations as well as individuals in China. It also promotes the secure development of technology and the digitization of the economy in China. Following entities come under the application scope of the CSL:\n\n**Network Operators:\n\n** It refers to the owners and administrators of networks and network service providers, and could be interpreted to include any companies providing services, or running their business through a computer network in China.\n\n**Critical Information Infrastructure Operators (CIIOs):\n\n** It refers to operators of critical information infrastructure in important industries and sectors (such as information service, public service, and e\n\ngovernment) and other information infrastructure that, if leaked, may severely threaten the national security, national economy, people’s livelihood, and public interests.\n\n**Network Products and Services Providers:\n\n** Organizations that provide information through networks or provide services to obtain information, including users, network services providers which provide network tools, devices, media, etc.\n\nCompliance with the CSL is not straightforward since CSL has several ambiguities and complicated obligations for network operators and CIIOs. Additional laws and guidelines will also be considered concerning the CSL compliance, including guidelines concerning the security assessment of cross- border transfers of personal information and important data, Data Security Law (DSL), and recently promulgated Personal Information Protection Law (PIPL).\n\nWe have prepared the following compliance checklist for the covered entities to ensure compliance with the CSL. Please note that this is not an exhaustive compliance list. For a detailed overview of the CSL, please refer to our article on What is China’s Cybersecurity Law?\n\n## 1\\. Fulfill Network Operations Security Requirements:\n\n## A. Requirements for network operators:\n\nNetwork operators must adopt the following security measures to prevent network interference, damage, or unauthorized access, and prevent network data from leakage, theft, or alteration:\n\nEstablish internal, \n## 5\\. Oblige with Product Safety and Certifications Requirements:\n\n## A. Requirements for Network Products and Services Providers:\n\nCybersecurity product manufacturers, security service suppliers, and other organizations that provide services through networks should oblige with the following requirements:\n\nNetwork products and services providers must not set up malicious programs.\n\nUpon discovering a security flaw, vulnerability, or another risk in their product or service, they must take remedial action immediately, inform users and report the issue to the relevant departments.\n\nNetwork product and service providers are required to conduct security maintenance for their products and services.\n\n## B. Requirements for CIIOs:\n\nCIIOs must, when procuring network products and services that may impact national security, submit the products and services to CAC and the State Council departments for a review for national security purposes. Critical network equipment and special cybersecurity products can only be sold or provided after being certified by a qualified establishment, and are in compliance with national standards.\n\n## 6\\. Fulfill Content Monitoring Requirements:\n\nAccording to Article 47 of the CSL, network operators are required to monitor the information released by their users for information that is “prohibited from being published or transmitted by laws or administrative regulations. If such information is discovered, network operators must cease the transmission of information, remove the information, keep records, and report any unlawful content to relevant authorities. Securiti helps organizations automate their privacy management operations using artificial intelligence and robotic automation. Request a demo and start your CSL compliance process today.\n\n## Join Our Newsletter\n\nGet all the latest information, law updates and more delivered to your inbox\n\n### Share\n\nCopy\n\n55\n\n### More Stories that May Interest You\n\nView More\n\nSeptember 11, 2023\n\n## Securiti named a Leader in the IDC MarketScape for Data Privacy Compliance Software\n\nSecuriti has just been recognized as a Leader in the “IDC MarketScape: Worldwide Data Privacy Compliance Software 2023 Vendor Assessment” report. This makes us...\n\nView More\n\nMay 10, 2023\n\n## Privacy\n\nby\n\nDesign and Privacy\n\nby\n\nDefault\n\nPrivacy-by-design and privacy-by-default are two cornerstone concepts of data protection regulatory frameworks. Thus, compliance thereof is an essential legal prerequisite for any entity which...\n\nView More\n\nApril 5,',
    "What security measures must network operators adopt to fulfill content monitoring requirements under China's Cybersecurity Law, and what obligations do network products and services providers and CIIOs have in relation to product safety and certifications?",
    'How does the PDPA in Malaysia protect personal data in commercial transactions and who does it apply to?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.08
cosine_accuracy@3 0.27
cosine_accuracy@5 0.45
cosine_accuracy@10 0.67
cosine_precision@1 0.08
cosine_precision@3 0.09
cosine_precision@5 0.09
cosine_precision@10 0.067
cosine_recall@1 0.08
cosine_recall@3 0.27
cosine_recall@5 0.45
cosine_recall@10 0.67
cosine_ndcg@10 0.3383
cosine_mrr@10 0.2359
cosine_map@100 0.2441

Information Retrieval

Metric Value
cosine_accuracy@1 0.06
cosine_accuracy@3 0.25
cosine_accuracy@5 0.39
cosine_accuracy@10 0.66
cosine_precision@1 0.06
cosine_precision@3 0.0833
cosine_precision@5 0.078
cosine_precision@10 0.066
cosine_recall@1 0.06
cosine_recall@3 0.25
cosine_recall@5 0.39
cosine_recall@10 0.66
cosine_ndcg@10 0.317
cosine_mrr@10 0.2124
cosine_map@100 0.2215

Information Retrieval

Metric Value
cosine_accuracy@1 0.05
cosine_accuracy@3 0.24
cosine_accuracy@5 0.38
cosine_accuracy@10 0.6
cosine_precision@1 0.05
cosine_precision@3 0.08
cosine_precision@5 0.076
cosine_precision@10 0.06
cosine_recall@1 0.05
cosine_recall@3 0.24
cosine_recall@5 0.38
cosine_recall@10 0.6
cosine_ndcg@10 0.2931
cosine_mrr@10 0.1985
cosine_map@100 0.2113

Information Retrieval

Metric Value
cosine_accuracy@1 0.05
cosine_accuracy@3 0.28
cosine_accuracy@5 0.36
cosine_accuracy@10 0.55
cosine_precision@1 0.05
cosine_precision@3 0.0933
cosine_precision@5 0.072
cosine_precision@10 0.055
cosine_recall@1 0.05
cosine_recall@3 0.28
cosine_recall@5 0.36
cosine_recall@10 0.55
cosine_ndcg@10 0.2783
cosine_mrr@10 0.1938
cosine_map@100 0.2078

Information Retrieval

Metric Value
cosine_accuracy@1 0.04
cosine_accuracy@3 0.21
cosine_accuracy@5 0.3
cosine_accuracy@10 0.53
cosine_precision@1 0.04
cosine_precision@3 0.07
cosine_precision@5 0.06
cosine_precision@10 0.053
cosine_recall@1 0.04
cosine_recall@3 0.21
cosine_recall@5 0.3
cosine_recall@10 0.53
cosine_ndcg@10 0.2516
cosine_mrr@10 0.1667
cosine_map@100 0.1772

Training Details

Training Dataset

Unnamed Dataset

  • Size: 900 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 159 tokens
    • mean: 446.78 tokens
    • max: 512 tokens
    • min: 9 tokens
    • mean: 22.04 tokens
    • max: 82 tokens
  • Samples:
    positive anchor
    issues related to the organization's privacy officers, exemption from consent requirements, biometric information registration, and breach reports. The next two stages will come into effect in September 2023 and September 2024, respectively.

    ### Hong Kong

    #### Hong Kong Personal Data (Privacy) Ordinance (PDPO)

    Effective Date : Since 1995 Region : APAC (Asia-Pacific)

    The PDPO is the primary legislation in Hong Kong which was enacted to protect the privacy of individuals’ personal data, and regulate the collection, holding, processing, disclosure, or use of personal data by the organizations. The PDPO applies to private and public sector organizations that process, use, hold, or collect personal data. It covers any organization that deals with the collection and processing of personal data irrespective of the location of processing, provided that the personal data is controlled by the data user based in Hong Kong.

    Resources*

    :

    Hong Kong PDPO Overview

    ### Ireland

    #### Irish Data Protection Act (DPA)

    Effective Date : May 24, 2018 Region : EMEA (Europe, the Middle East and Africa)

    The Irish DPA implements the GDPR into the national law by incorporating most of the provisions of the GDPR with limited additions and deletions. It contains several provisions restricting data subjects’ rights that they generally have under the GDPR, for example, where restrictions are necessary for the enforcement of civil law claims.

    Resources*

    :

    Irish DPA Overview

    Irish Cookie Guidance

    ### Japan

    #### Japan’s Act on the Protection of Personal Information (APPI)

    Effective Date (Amended APPI) : April 01, 2022 Region : APAC (Asia-Pacific)

    Japan’s APPI regulates personal related information and applies to any Personal Information Controller (the “PIC''), that is a person or entity providing personal related information for use in business in Japan. The APPI also applies to the foreign PICs which handle personal information of data subjects (“principals”) in Japan for the purpose of supplying goods or services to those persons.The act ensures the individual’s rights to privacy and also the legal use of personal data for economic development.

    Resources*

    :

    Japan APPI Overview

    ### New Zealand

    #### New Zealand
    What are the regulations regarding breach reports in New Zealand?
    data. Finally, as previously mentioned, consumers can opt-out of the collection of their sensitive personal data.

    Means to submit DSR request:

    A consumer may exercise a right by submitting an authenticated request to a controller, by means prescribed by the controller, specifying the right the consumer intends to exercise. In the instance of processing personal data concerning a child, the parent or legal guardian of the child can exercise a right on the child's behalf. In the case of processing personal data concerning a consumer subject to guardianship, conservatorship, or other protective arrangements under Title 75, Chapter 5, Protection of Persons Under Disability and Their Property, the guardian or the conservator of the consumer shall exercise a right on the consumer's behalf.

    Time period to fulfill DSR request

    : A controller shall comply with a consumer's request to exercise a right within 45 days after the day on which a controller had received that particular request. The controller then shall take action on the consumer's request; and inform the consumer of any action taken on the consumer's request.

    Extension in the time period:

    An additional 45 days can be granted if it is reasonably necessary to comply with the request, keeping in mind the complexity of the request or the volume of the requests received by the controller. In such cases, the controller is to inform the consumer of the extension and provide reasons for the extension.

    Charges:

    Controllers are not allowed to charge a fee for responding to a request under the law apart from certain situations. If the request is a consumer's second or subsequent request within the same 12

    month period, a controller may charge a reasonable fee. A controller may also charge a reasonable fee to cover the administrative costs of complying with a request or refuse to act on a request if:

    the request is excessive, repetitive, technically infeasible as per the law; or

    the controller considers that the primary goal for the submitted request was something other than exercising a right; or

    the request disrupts or imposes an undue burden on the resources of the controller’s business.

    Appeal against refusal:

    The data controller may choose to not to take action on a consumer’s DSR request. It must provide the consumer the reasons for which it did not take the action within the 45 days time period of receiving the DSR request. The data controller may also choose to not honor the request
    What is the time frame for a controller to fulfill a consumer's request to exercise a right, and what can extend this period?
    or use of personal data. This is the same as the term 'data controller.'

    ## Data Processor

    Data Processor is a person or entity who processes personal data on behalf of another person or entity (a data user) instead of for his/her purpose(s).

    ## Consent

    Consent is not a prerequisite for collecting personal data unless the personal data is used for a new purpose or for direct marketing purposes. Where consent is required, consent means to express and voluntary consent.

    ## Data Subjects' Rights under the PDPO:

    The PDPO prescribes the following rights for the data subjects;

    DPP 6 provides data subjects with the right to request access to and correction of their personal data. A data user should give reasons when refusing a data subject’s request to access or correction of his/her personal data.

    Data subjects have the right to be informed by data user(s) regarding the holding of their personal data.

    There is no explicit right to erasure available under the PDPO, however, data subjects can request the data user to delete his/her personal data that is no longer necessary for the processing. Also, data users are not allowed to retain personal data longer than necessary.

    Under the PDPO, there is no right to object to processing (including profiling) available, but data subjects may opt

    out from direct marketing activities.

    ## Who needs to comply with the PDPO?

    The PDPO applies to private and public sector organizations that process, use, hold, or collect personal data. It covers any organization that deals with the collection and processing of personal data irrespective of the location of processing provided that the personal data is controlled by the data user based in Hong Kong.

    The PDPO provides the following exemptions for the processing of personal data in Part VIII;

    specified public or judicial interests

    domestic or recreational purposes, or for

    employment purposes.

    The PDPO does not directly regulate data processors; therefore, they do not directly come under the application scope of the PDPO. However, data users are required to, by contractual or other means, ensure that their data processors meet the applicable requirements of the PDPO.

    ## Organizations' obligations under the PDPO:

    PDPO does not explicitly state accountability principles and other privacy management related measures; however, the PCPD recommends
    What rights do data subjects have under the PDPO regarding the right to object to processing, and what are the limitations?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 2
  • learning_rate: 2e-05
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 2
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
0.6897 10 8.029 - - - - -
0.9655 14 - 0.2004 0.2241 0.2170 0.1726 0.2279
1.3793 20 5.6389 - - - - -
1.931 28 - 0.2078 0.2113 0.2215 0.1772 0.2441
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.31.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
1
Safetensors
Model size
109M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results