File size: 2,444 Bytes
9dcff01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
language:
- en
library_name: transformers
---
# BERT Model for Software Engineering 

This repository was created within the scope of computer engineering undergraduate graduation project.
This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects.
In order to perform this task we created two models, SE-BERT and [SE-BERTurk](https://huggingface.co/burakkececi/bert-turkish-software-engineering).

# SE-BERT

SE-BERT is a BERT model trained for domain adaptation in a software engineering context.

We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context.

## Stats
Created a bilingual [SE corpus](https://drive.google.com/file/d/1IgnJTaR2-pe889TdQZtYF8SKOH92mi1l/view?usp=drive_link) (166Mb) ➡️ [Descriptive stats of the corpus](https://docs.google.com/spreadsheets/d/1Xnn_xfu4tdCtWg-nQ8ce_LHe9F-g0BSmUxzTdi5g1r4/edit?usp=sharing) 
 * 166K entry = 886K sentence = 10M words
 * 156K training entry + 10K test entry
 * Each entry has a maximum length of 512 tokens

The final training corpus has a size of 166MB and 10.554.750 words.

## MLM Training (Domain Adaptation)
Used ``AdamW`` optimizer and set ``num_epochs = 1``, ``lr = 2e-5``, ``eps = 1e-8``
  * For T4 GPU ➡️ Set ``batch_size = 6`` (13.5Gb memory) 
  * For A100 GPU ➡️ Set ``batch_size = 50`` (37Gb memory) and ``fp16 = True``

**Perplexity**
 * ``6,673`` PPL for SE-BERT

### Evaluation Steps:
1) Calculate ``PPL`` (perplexity) on the test corpus (10K context with a maximum length of 512 tokens)
2) Calculate ``PPL`` (perplexity) on the requirement datasets
3) Evaluate performance on downstream tasks:
  * For size measurement ➡️ ``MAE``, ``MSE``, ``MMRE``, ``PRED(30)``, ``ACC``

## Usage

With Transformers >= 2.11 our SE-BERT uncased model can be loaded like:

```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-software-engineering/model")
model = AutoModel.from_pretrained("burakkececi/bert-software-engineering/tokenizer")
```

# Huggingface model hub

All models are available on the [Huggingface model hub](https://huggingface.co/burakkececi).