File size: 2,089 Bytes
036e551
 
 
 
 
 
 
 
 
 
 
 
 
5137c8d
 
036e551
 
 
 
 
 
 
 
 
 
 
 
 
5137c8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
language:
- en
inference: false
tags:
- BERT
- BNC-BERT
- encoder
license: cc-by-4.0
---

# BNC-BERT

- Paper: [Trained on 100 million words and still in shape: BERT meets British National Corpus](https://arxiv.org/abs/2303.09859)
- GitHub: [ltgoslo/ltg-bert](https://github.com/ltgoslo/ltg-bert)

## Example usage

This model currently needs a custom wrapper from `modeling_ltgbert.py`. Then you can use it like this:

```python
import torch
from transformers import AutoTokenizer
from modeling_ltgbert import LtgBertForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("path/to/folder")
bert = LtgBertForMaskedLM.from_pretrained("path/to/folder")
```

## Please cite the following publication (just arXiv for now)
```bibtex
@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    pages = "1954--1974",
    abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
}
```