bge-large-en-v1.5-ISO-27001

This is a fine-tuned embedding model of bge-large-en-v1.5. It was fine-tuned on a dataset based on an ISO 27001 text corpus consisting of text chunks (1024 characters) and associated questions. A total of 2.000 chunk and question pairs were generated. The fine-tuning process is specialized on an Information Retrieval task in which the generated questions are used to find the relevant chunks. The effectiveness of the model is evaluated on whether the correct chunk was retrieved, and the loss is calculated with the multiple negative ranking loss.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bge-large-en-v1.5-ISO-27001')
embeddings = model.encode(sentences)
print(embeddings)

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 200 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 5,
    "evaluation_steps": 50,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Citing & Authors

Based on https://huggingface.co/BAAI/bge-large-en-v1.5 from Xiao et al. (2023) (C-Pack: Packaged Resources To Advance General Chinese Embedding)

Basti8499
/

bge-large-en-v1.5-ISO-27001

bge-large-en-v1.5-ISO-27001

Usage (Sentence-Transformers)

Training

Full Model Architecture

Citing & Authors

Space using Basti8499/bge-large-en-v1.5-ISO-27001 1