|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
# BERT Model for Software Engineering |
|
|
|
This repository was created within the scope of computer engineering undergraduate graduation project. |
|
This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects. |
|
In order to perform this task we created two models, SE-BERT and [SE-BERTurk](https://huggingface.co/burakkececi/bert-turkish-software-engineering). |
|
|
|
# SE-BERT |
|
|
|
SE-BERT is a BERT model trained for domain adaptation in a software engineering context. |
|
|
|
We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context. |
|
|
|
## Stats |
|
Created a bilingual [SE corpus](https://drive.google.com/file/d/1IgnJTaR2-pe889TdQZtYF8SKOH92mi1l/view?usp=drive_link) (166Mb) ➡️ [Descriptive stats of the corpus](https://docs.google.com/spreadsheets/d/1Xnn_xfu4tdCtWg-nQ8ce_LHe9F-g0BSmUxzTdi5g1r4/edit?usp=sharing) |
|
* 166K entry = 886K sentence = 10M words |
|
* 156K training entry + 10K test entry |
|
* Each entry has a maximum length of 512 tokens |
|
|
|
The final training corpus has a size of 166MB and 10.554.750 words. |
|
|
|
## MLM Training (Domain Adaptation) |
|
Used ``AdamW`` optimizer and set ``num_epochs = 1``, ``lr = 2e-5``, ``eps = 1e-8`` |
|
* For T4 GPU ➡️ Set ``batch_size = 6`` (13.5Gb memory) |
|
* For A100 GPU ➡️ Set ``batch_size = 50`` (37Gb memory) and ``fp16 = True`` |
|
|
|
**Perplexity** |
|
* ``6,673`` PPL for SE-BERT |
|
|
|
### Evaluation Steps: |
|
1) Calculate ``PPL`` (perplexity) on the test corpus (10K context with a maximum length of 512 tokens) |
|
2) Calculate ``PPL`` (perplexity) on the requirement datasets |
|
3) Evaluate performance on downstream tasks: |
|
* For size measurement ➡️ ``MAE``, ``MSE``, ``MMRE``, ``PRED(30)``, ``ACC`` |
|
|
|
## Usage |
|
|
|
With Transformers >= 2.11 our SE-BERT uncased model can be loaded like: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-software-engineering/model") |
|
model = AutoModel.from_pretrained("burakkececi/bert-software-engineering/tokenizer") |
|
``` |
|
|
|
# Huggingface model hub |
|
|
|
All models are available on the [Huggingface model hub](https://huggingface.co/burakkececi). |