Edit model card

bge-large-en-v1.5-ISO-27001

This is a fine-tuned embedding model of bge-large-en-v1.5. It was fine-tuned on a dataset based on an ISO 27001 text corpus consisting of text chunks (1024 characters) and associated questions. A total of 2.000 chunk and question pairs were generated. The fine-tuning process is specialized on an Information Retrieval task in which the generated questions are used to find the relevant chunks. The effectiveness of the model is evaluated on whether the correct chunk was retrieved, and the loss is calculated with the multiple negative ranking loss.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bge-large-en-v1.5-ISO-27001')
embeddings = model.encode(sentences)
print(embeddings)

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 200 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 5,
    "evaluation_steps": 50,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Citing & Authors

Based on https://huggingface.co/BAAI/bge-large-en-v1.5 from Xiao et al. (2023) (C-Pack: Packaged Resources To Advance General Chinese Embedding)

Downloads last month
9
Safetensors
Model size
335M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using Basti8499/bge-large-en-v1.5-ISO-27001 1