Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,8 @@ license: cc-by-4.0
|
|
11 |
|
12 |
# BNC-BERT
|
13 |
|
|
|
|
|
14 |
|
15 |
## Example usage
|
16 |
|
@@ -24,3 +26,22 @@ from modeling_ltgbert import LtgBertForMaskedLM
|
|
24 |
tokenizer = AutoTokenizer.from_pretrained("path/to/folder")
|
25 |
bert = LtgBertForMaskedLM.from_pretrained("path/to/folder")
|
26 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
# BNC-BERT
|
13 |
|
14 |
+
- Paper: [Trained on 100 million words and still in shape: BERT meets British National Corpus](https://arxiv.org/abs/2303.09859)
|
15 |
+
- GitHub: [ltgoslo/ltg-bert](https://github.com/ltgoslo/ltg-bert)
|
16 |
|
17 |
## Example usage
|
18 |
|
|
|
26 |
tokenizer = AutoTokenizer.from_pretrained("path/to/folder")
|
27 |
bert = LtgBertForMaskedLM.from_pretrained("path/to/folder")
|
28 |
```
|
29 |
+
|
30 |
+
## Please cite the following publication (just arXiv for now)
|
31 |
+
```bibtex
|
32 |
+
@inproceedings{samuel-etal-2023-trained,
|
33 |
+
title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
|
34 |
+
author = "Samuel, David and
|
35 |
+
Kutuzov, Andrey and
|
36 |
+
{\O}vrelid, Lilja and
|
37 |
+
Velldal, Erik",
|
38 |
+
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
|
39 |
+
month = may,
|
40 |
+
year = "2023",
|
41 |
+
address = "Dubrovnik, Croatia",
|
42 |
+
publisher = "Association for Computational Linguistics",
|
43 |
+
url = "https://aclanthology.org/2023.findings-eacl.146",
|
44 |
+
pages = "1954--1974",
|
45 |
+
abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
|
46 |
+
}
|
47 |
+
```
|