Instructions to use nancy-noubou/bge-base-iso-clauses-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nancy-noubou/bge-base-iso-clauses-v1 with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nancy-noubou/bge-base-iso-clauses-v1")

sentences = [
"Represent this sentence for searching relevant passages: The organization shall conduct internal audits at planned intervals to provide information on whether the information security management system: a) conforms to 1) the organization’s own requirements for its information security management system; 2) the requirements of this document; b) is effectively implemented and maintained.",
"Title: Stratos Inventory Drone X1 Maintenance Procedure. Effective Date: 2023-09-15. Owner: Marc Petit, Ops Manager. Purpose: To provide guidelines for the routine maintenance of the Stratos Inventory Drone X1 to maximize operational efficiency. Scope: This procedure applies to the maintenance team and involves inspections, cleanings, and repairs. Process: 1) Daily Inspection. Check the physical condition of the drone, including propellers, battery, and sensors. Document findings using the Daily Maintenance Log (DML-2023). 2) Cleaning. Remove dust and debris from all surfaces with a soft cloth and appropriate cleaning agents. For the sensor, ensure there are no obstructions; clean with a microfiber cloth. 3) Weekly Review. Conduct a more thorough inspection every week. Examine internal components for wear and tear. Any significant findings must be reported to QA for evaluation. 4) Annual Overhaul. An extensive inspection should be performed every 12 months, where all parts are evaluated and replaced as necessary. Results are stored in the Annual Review Document (ARD-2024) for historical tracking.",
"Title: Stratos Inventory Drone X1 Operator Training Course. Date: 2023-10-15. Conducted by: Emily Rios, HR Director. Attendees: 15 new operators from various departments. Course Outline: 1) Overview of drone capabilities and functionalities. 2) Hands-on calibration and maintenance training. 3) Safety protocols and incident response procedures. 4) Review of performance monitoring metrics. Outcomes: All participants successfully completed the training, with an average score of 90% on the final assessment. Feedback indicated that 85% of attendees felt confident in their ability to operate the drones post-training. Follow-up sessions will be scheduled for Q1 2024 to provide refreshers and cover updates from the latest performance evaluation. Certification records will be stored in the Training Database (TD-2023) for reference and future training needs.",
"Calibration Records (ID: CR-2023-56) dated May 1, 2023, for the Tactical Communication System E indicate that the system's signal integrity performance aligns with operational specifications. The RF Transceiver Module was calibrated to operate within the frequency range of 30 MHz to 512 MHz, achieving a signal-to-noise ratio of 30 dB across the tested range. This ensures clarity and reliability during tactical communications. The calibration was conducted by certified technician Bob Johnson, usi"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-base-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Supported Modality: Text

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
  (1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'cls', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("nancy-noubou/bge-base-iso-clauses-v1")
# Run inference
sentences = [
    'Represent this sentence for searching relevant passages: The organization shall maintain documented information to the extent necessary to have confidence that the processes are being carried out as planned and to demonstrate the conformity of products and services to requirements. The organization shall determine: a) what documented information is necessary for the effectiveness of the quality management system; b) the documented information to be retained to provide evidence of conformity; and c) the period for which it shall be retained.',
    'Management Review Notes: Date: 2023-12-01. Participants: Marc Petit (Ops Manager), Sarah Mendez (QA Lead), David Huang (Engineering Supervisor). Agenda Items: 1) Review of Q3 performance metrics related to the Lumen Hull Sensor Edge. 2) Discussions on improving incident response times. 3) Updates on supplier performance and feedback. Decisions Made: - Metrics showed room for improvement in sensor accuracy under varying conditions; action required from the engineering team to address findings before the next review. - A timeline to implement the proposed incident response targets was established, with updates due by the next quarterly meeting on 2024-01-15. - Agreed to continue monitoring Supplier A’s performance and reassess in the Q1 review. Next Meeting: Scheduled for 2024-01-15 to discuss progress and metrics.',
    'The calibration records (Document ID: CR-2023-045) for the Surgical Robot R indicate that the robotic arm calibration was last performed on August 15, 2022, making it 14 months overdue for recalibration. The last recorded precision test showed an average positioning error of 1.5 mm, which is above the acceptable threshold of 0.5 mm for surgical applications. These records were compiled by Mark Johnson from the Calibration Department. The outdated calibration status of the robotic arm poses a ris',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4010, 0.1300],
#         [0.4010, 1.0000, 0.1790],
#         [0.1300, 0.1790, 1.0000]])

Training Details

Training Dataset

Unnamed Dataset

Size: 23,514 training samples
Columns: sentence_0, sentence_1, and sentence_2

Approximate statistics based on the first 1000 samples:

	sentence_0	sentence_1	sentence_2
type	string	string	string
details	min: 19 tokens mean: 78.93 tokens max: 512 tokens	min: 89 tokens mean: 165.83 tokens max: 305 tokens	min: 4 tokens mean: 146.79 tokens max: 306 tokens

Samples:

sentence_0	sentence_1	sentence_2
`Represent this sentence for searching relevant passages: The manufacturer shall determine measures that are appropriate for reducing the risks to an acceptable level. The manufacturer shall use one or more of the following options in the priority order listed: a) inherently safe design and manufacture;`	In the Design Review Document (ID: DRM-2023-04), dated March 15, 2023, the engineering team conducted a comprehensive risk analysis for the Cold Chain Monitor N. During the review, it was determined that utilizing a temperature-resistant housing made from ABS plastic (with a thermal resistance rating of -40°C to 70°C) significantly reduces the risk of equipment failure in extreme conditions. The housing design was verified through thermal cycling tests, showing a consistent performance with an i	`In the HR training schedule released on August 20, 2023 (Doc ID: HR-TRAIN-CCM-2023-08), the focus is on the onboarding process for new employees involved in the Cold Chain Monitor N project. The document outlines a comprehensive two-week orientation that covers company policies, team structure, and basic operational procedures. Notably, it includes a session on the importance of maintaining strict temperature controls during product handling, which is vital for the device's effectiveness. Althou`
`Represent this sentence for searching relevant passages: The organization shall determine and manage the knowledge necessary for the operation of its processes and to achieve conformity of products and services. This knowledge shall be maintained and made available to the extent necessary.`	Date: 2024-02-20. Attendees: Senior Management Team including Marc Petit (Ops Manager), Sarah Mendez (Quality Manager), and Emma Li (Regulatory Affairs). Agenda Items: 1) Review quarterly performance metrics. 2) Discuss customer feedback findings. 3) Evaluate training program effectiveness. Notes: 1) Performance metrics show a 20% increase in production efficiency; actions taken in the prior quarter are yielding results. 2) Customer feedback indicated areas for improvement particularly in user instructions; an initiative to revise manuals was approved. 3) Training effectiveness was acknowledged, and it was decided to implement bi-annual refresher sessions. Decisions Made: 1) Marcy to lead the manual revision initiative with expected completion by April 30, 2024. 2) Sarah to outline a plan for the bi-annual refresher sessions, targeting early May 2024 for the first session.	Title: Risk Assessment for Quantum Diagnostic Imager Elite. Date: 2024-03-01. Conducted by: Marc Petit, Operations Manager, and Sarah Mendez, Quality Manager. Identified Risks: 1) Risk of imaging inaccuracy due to equipment malfunction. 2) Supplier dependency affecting materials quality. Mitigation Actions: For the first risk, a comprehensive calibration schedule has been established, with reminders set in the system to ensure timely execution. Additionally, the training program for technicians has been enhanced to include troubleshooting for common malfunctions. For the second risk, diversifying suppliers has been outlined as a strategy, with an evaluation of potential candidates already underway. Monitoring plans include quarterly reviews of equipment performance and supplier assessments, documented in the Risk Management Log (RML-2024).
Represent this sentence for searching relevant passages: The results of this review shall be recorded in the management file. Compliance is checked by inspection of the evaluation of overall residual risk. The manufacturer shall evaluate the overall residual risk posed by the medical device, taking into account the contributions of all risk control measures that have been implemented and verified, in relation to the criteria for acceptability of the overall residual risk defined in the risk management plan. If the overall residual risk is judged acceptable, the manufacturer shall inform users of significant residual risks and shall include the necessary information in the documentation in order to disclose those residual risks.	Calibration records (ID: CCMO-CAL-2023-012) for the Cold Chain Monitor O were last updated on May 10, 2023. The temperature sensors were calibrated using NIST-traceable standards, with a measured accuracy of ±0.2°C. This process was conducted by the Calibration Specialist, Emily White, and included the verification of 10 sensors across different units. Each calibration was documented with specific reference to the calibration equipment used, which is regularly maintained and validated against pr	`An internal draft titled 'Cold Chain Monitor O Risk Management Procedure' (Document ID: DRAFT-2023-009) was circulated on April 5, 2023. This document outlines a proposed framework for identifying risks associated with the product's performance in differing environmental conditions. While the draft emphasizes the significance of risk evaluation, it lacks specific metrics or a step-by-step process for implementation. The final approval of this procedure is pending, with no set date for completion`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false,
    "directions": [
        "query_to_doc"
    ],
    "partition_mode": "joint",
    "hardness_mode": null,
    "hardness_strength": 0.0
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 1
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_ratio: None
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
enable_jit_checkpoint: False
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
use_cpu: False
seed: 42
data_seed: None
bf16: False
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: -1
ddp_backend: None
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_for_metrics: []
eval_do_concat_batches: True
auto_find_batch_size: False
full_determinism: False
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
use_cache: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss
0.3401	500	2.6430
0.6803	1000	2.2176

Training Time

Training: 54.4 minutes

Framework Versions

Python: 3.12.13
Sentence Transformers: 5.4.0
Transformers: 5.0.0
PyTorch: 2.10.0+cu128
Accelerate: 1.13.0
Datasets: 4.8.5
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}