File size: 2,847 Bytes
d964534
6b129b8
 
 
 
 
 
5aec005
 
d964534
6b129b8
78a4be2
6b129b8
5aec005
6b129b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78a4be2
6b129b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aec005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
license: mit
---

# bge-large-en-v1.5-ISO-27001

This is a fine-tuned embedding model of [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5). It was fine-tuned on a dataset based on an ISO 27001 text corpus consisting of text chunks (1024 characters) and associated questions. A total of 2.000 chunk and question pairs were generated. The fine-tuning process is specialized on an Information Retrieval task in which the generated questions are used to find the relevant chunks. The effectiveness of the model is evaluated on whether the correct chunk was retrieved, and the loss is calculated with the multiple negative ranking loss.

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bge-large-en-v1.5-ISO-27001')
embeddings = model.encode(sentences)
print(embeddings)
```

## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 200 with parameters:
```
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
  ```
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 5,
    "evaluation_steps": 50,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Citing & Authors
Based on https://huggingface.co/BAAI/bge-large-en-v1.5 from Xiao et al. (2023) (C-Pack: Packaged Resources To Advance General Chinese Embedding)