license: mit
language:
- en
library_name: transformers
BERT Model for Software Engineering
This repository was created within the scope of computer engineering undergraduate graduation project. This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects. In order to perform this task we created two models, SE-BERT and SE-BERTurk.
SE-BERT
SE-BERT is a BERT model trained for domain adaptation in a software engineering context.
We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context.
Stats
Created a bilingual SE corpus (166Mb) ➡️ Descriptive stats of the corpus
- 166K entry = 886K sentence = 10M words
- 156K training entry + 10K test entry
- Each entry has a maximum length of 512 tokens
The final training corpus has a size of 166MB and 10.554.750 words.
MLM Training (Domain Adaptation)
Used AdamW
optimizer and set num_epochs = 1
, lr = 2e-5
, eps = 1e-8
- For T4 GPU ➡️ Set
batch_size = 6
(13.5Gb memory) - For A100 GPU ➡️ Set
batch_size = 50
(37Gb memory) andfp16 = True
Perplexity
6,673
PPL for SE-BERT
Evaluation Steps:
- Calculate
PPL
(perplexity) on the test corpus (10K context with a maximum length of 512 tokens) - Calculate
PPL
(perplexity) on the requirement datasets - Evaluate performance on downstream tasks:
- For size measurement ➡️
MAE
,MSE
,MMRE
,PRED(30)
,ACC
Usage
With Transformers >= 2.11 our SE-BERT uncased model can be loaded like:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-software-engineering/model")
model = AutoModel.from_pretrained("burakkececi/bert-software-engineering/tokenizer")
Huggingface model hub
All models are available on the Huggingface model hub.