File size: 2,415 Bytes
61d75cb
8bb5b0f
 
 
 
 
 
 
61d75cb
8bb5b0f
66c6215
8bb5b0f
de6d637
 
 
 
7792eca
 
8bb5b0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de6d637
8bb5b0f
 
 
 
 
 
7792eca
 
8bb5b0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers

---

# GitHub Issues MPNet Sentence Transformer (10 Epochs)

This is a [sentence-transformers](https://www.SBERT.net) model, specific for GitHub Issue data.

## Dataset

For training, we used the [NLBSE22 dataset](https://nlbse2022.github.io/tools/), after removing issues with empty body and duplicates.
Similarity between title and body was used to train the sentence embedding model.


## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('Collab-uniba/github-issues-mpnet-st-e10')
embeddings = model.encode(sentences)
print(embeddings)
```


## Training
The model was trained for ten epochs, using Multiple Negative Ranking Loss. We assumed that title and body of the same issue have to be similar.
We used the following parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 39221 with parameters:
```
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
  ```
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 10,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 39221,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

## Citing & Authors

<!--- Describe where people can find more information -->