mmukh
/

SOBertLarge

Inference Endpoints

Model card Files Files and versions Community

SOBertLarge / README.md

mmukh's picture

Update README.md

f7bb883 verified 7 months ago

|

history blame contribute delete

1.61 kB

	---
	language:
	- en
	---
	# SOBertLarge

	## Model Description

	SOBertLarge is a 762M parameter BERT model trained on 27 billion tokens of SO data StackOverflow answer and comment text using the Megatron Toolkit.

	SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include
	all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown".
	Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively
	large batch size of 0.5M tokens. A smaller 109 million parameter model can also be found [here](https://huggingface.co/mmukh/SOBertBase) . More details can be found in the paper
	[Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268).

	#### How to use

	```python
	from transformers import MegatronBertModel,PreTrainedTokenizerFast
	model = MegatronBertModel.from_pretrained("mmukh/SOBertLarge")
	tokenizer = PreTrainedTokenizerFast.from_pretrained("mmukh/SOBertLarge")

	```

	### BibTeX entry and citation info

	```bibtex
	@article{mukherjee2023stack,
	title={Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models},
	author={Mukherjee, Manisha and Hellendoorn, Vincent J},
	journal={arXiv preprint arXiv:2306.03268},
	year={2023}
	}
	```