|
# StackOBERTflow-comments-small |
|
|
|
StackOBERTflow is a RoBERTa model trained on StackOverflow comments. |
|
A Byte-level BPE tokenizer with dropout was used (using the `tokenizers` package). |
|
|
|
The model is *small*, i.e. has only 6-layers and the maximum sequence length was restricted to 256 tokens. |
|
The model was trained for 6 epochs on several GBs of comments from the StackOverflow corpus. |
|
|
|
## Quick start: masked language modeling prediction |
|
|
|
```python |
|
from transformers import pipeline |
|
from pprint import pprint |
|
|
|
COMMENT = "You really should not do it this way, I would use <mask> instead." |
|
|
|
fill_mask = pipeline( |
|
"fill-mask", |
|
model="giganticode/StackOBERTflow-comments-small-v1", |
|
tokenizer="giganticode/StackOBERTflow-comments-small-v1" |
|
) |
|
|
|
pprint(fill_mask(COMMENT)) |
|
# [{'score': 0.019997311756014824, |
|
# 'sequence': '<s> You really should not do it this way, I would use jQuery instead.</s>', |
|
# 'token': 1738}, |
|
# {'score': 0.01693696901202202, |
|
# 'sequence': '<s> You really should not do it this way, I would use arrays instead.</s>', |
|
# 'token': 2844}, |
|
# {'score': 0.013411642983555794, |
|
# 'sequence': '<s> You really should not do it this way, I would use CSS instead.</s>', |
|
# 'token': 2254}, |
|
# {'score': 0.013224546797573566, |
|
# 'sequence': '<s> You really should not do it this way, I would use it instead.</s>', |
|
# 'token': 300}, |
|
# {'score': 0.011984303593635559, |
|
# 'sequence': '<s> You really should not do it this way, I would use classes instead.</s>', |
|
# 'token': 1779}] |
|
``` |
|
|