|
--- |
|
language: |
|
- en |
|
--- |
|
# SOBertLarge |
|
|
|
## Model Description |
|
|
|
SOBertLarge is a 762M parameter BERT model trained on 27 billion tokens of SO data StackOverflow answer and comment text using the Megatron Toolkit. |
|
|
|
SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include |
|
all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown". |
|
Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively |
|
large batch size of 0.5M tokens. A smaller 109 million parameter model can also be found [here](https://huggingface.co/mmukh/SOBertBase) . More details can be found in the paper |
|
[Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268). |
|
|
|
#### How to use |
|
|
|
```python |
|
from transformers import MegatronBertModel,PreTrainedTokenizerFast |
|
model = MegatronBertModel.from_pretrained("mmukh/SOBertLarge") |
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("mmukh/SOBertLarge") |
|
|
|
``` |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{mukherjee2023stack, |
|
title={Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models}, |
|
author={Mukherjee, Manisha and Hellendoorn, Vincent J}, |
|
journal={arXiv preprint arXiv:2306.03268}, |
|
year={2023} |
|
} |
|
``` |