burakkececi
/

bert-software-engineering

Inference Endpoints

Model card Files Files and versions Community

bert-software-engineering / README.md

burakkececi's picture

Create README.md

9dcff01 verified 5 months ago

|

2.44 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	---
	# BERT Model for Software Engineering

	This repository was created within the scope of computer engineering undergraduate graduation project.
	This research aims to perform an exploratory case study to determine the functional dimensions of user requirements or use cases for software projects.
	In order to perform this task we created two models, SE-BERT and [SE-BERTurk](https://huggingface.co/burakkececi/bert-turkish-software-engineering).

	# SE-BERT

	SE-BERT is a BERT model trained for domain adaptation in a software engineering context.

	We applied Masked Language Modeling (MLM), an unsupervised learning technique, for domain adaptation. MLM enhances the model understanding of domain-specific language by masking portions of the input text and training the model to predict the masked words based on the surrounding context.

	## Stats
	Created a bilingual [SE corpus](https://drive.google.com/file/d/1IgnJTaR2-pe889TdQZtYF8SKOH92mi1l/view?usp=drive_link) (166Mb) ➡️ [Descriptive stats of the corpus](https://docs.google.com/spreadsheets/d/1Xnn_xfu4tdCtWg-nQ8ce_LHe9F-g0BSmUxzTdi5g1r4/edit?usp=sharing)
	* 166K entry = 886K sentence = 10M words
	* 156K training entry + 10K test entry
	* Each entry has a maximum length of 512 tokens

	The final training corpus has a size of 166MB and 10.554.750 words.

	## MLM Training (Domain Adaptation)
	Used ``AdamW`` optimizer and set ``num_epochs = 1``, ``lr = 2e-5``, ``eps = 1e-8``
	* For T4 GPU ➡️ Set ``batch_size = 6`` (13.5Gb memory)
	* For A100 GPU ➡️ Set ``batch_size = 50`` (37Gb memory) and ``fp16 = True``

	Perplexity
	* ``6,673`` PPL for SE-BERT

	### Evaluation Steps:
	1) Calculate ``PPL`` (perplexity) on the test corpus (10K context with a maximum length of 512 tokens)
	2) Calculate ``PPL`` (perplexity) on the requirement datasets
	3) Evaluate performance on downstream tasks:
	* For size measurement ➡️ ``MAE``, ``MSE``, ``MMRE``, ``PRED(30)``, ``ACC``

	## Usage

	With Transformers >= 2.11 our SE-BERT uncased model can be loaded like:

	```python
	from transformers import AutoModel, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("burakkececi/bert-software-engineering/model")
	model = AutoModel.from_pretrained("burakkececi/bert-software-engineering/tokenizer")
	```

	# Huggingface model hub

	All models are available on the [Huggingface model hub](https://huggingface.co/burakkececi).