davda54 commited on
Commit
5137c8d
1 Parent(s): 036e551

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -11,6 +11,8 @@ license: cc-by-4.0
11
 
12
  # BNC-BERT
13
 
 
 
14
 
15
  ## Example usage
16
 
@@ -24,3 +26,22 @@ from modeling_ltgbert import LtgBertForMaskedLM
24
  tokenizer = AutoTokenizer.from_pretrained("path/to/folder")
25
  bert = LtgBertForMaskedLM.from_pretrained("path/to/folder")
26
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # BNC-BERT
13
 
14
+ - Paper: [Trained on 100 million words and still in shape: BERT meets British National Corpus](https://arxiv.org/abs/2303.09859)
15
+ - GitHub: [ltgoslo/ltg-bert](https://github.com/ltgoslo/ltg-bert)
16
 
17
  ## Example usage
18
 
 
26
  tokenizer = AutoTokenizer.from_pretrained("path/to/folder")
27
  bert = LtgBertForMaskedLM.from_pretrained("path/to/folder")
28
  ```
29
+
30
+ ## Please cite the following publication (just arXiv for now)
31
+ ```bibtex
32
+ @inproceedings{samuel-etal-2023-trained,
33
+ title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
34
+ author = "Samuel, David and
35
+ Kutuzov, Andrey and
36
+ {\O}vrelid, Lilja and
37
+ Velldal, Erik",
38
+ booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
39
+ month = may,
40
+ year = "2023",
41
+ address = "Dubrovnik, Croatia",
42
+ publisher = "Association for Computational Linguistics",
43
+ url = "https://aclanthology.org/2023.findings-eacl.146",
44
+ pages = "1954--1974",
45
+ abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
46
+ }
47
+ ```