Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## BabyBERTA
|
2 |
+
|
3 |
+
### Overview
|
4 |
+
|
5 |
+
BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
|
6 |
+
It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
|
7 |
+
|
8 |
+
## Loading the tokenizer
|
9 |
+
|
10 |
+
BabyBERTa was trained with `add_prefix_space=True`, so it will not work properly with the tokenizer defaults.
|
11 |
+
Make sure to load the tokenizer as follows:
|
12 |
+
|
13 |
+
```python
|
14 |
+
tokenizer = RobertaTokenizerFast.from_pretrained("phueb/BabyBERTa",
|
15 |
+
add_prefix_space=True)
|
16 |
+
```
|
17 |
+
|
18 |
+
### Performance
|
19 |
+
|
20 |
+
The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.
|
21 |
+
This model was trained for 400K steps, and achieves an overall accuracy of 80.3,
|
22 |
+
comparable to RoBERTa-base, which achieves an overall accuracy of 82.6 on the latest version of Zorro (as of October, 2021).
|
23 |
+
|
24 |
+
Both values differ slightly from those reported in the paper (Huebner et al., 2020).
|
25 |
+
There are two reasons for this:
|
26 |
+
1. Performance of RoBERTa-base is slightly larger because the authors previously lower-cased all words in Zorro before evaluation.
|
27 |
+
Lower-casing of proper nouns is detrimental to RoBERTa-base because RoBERTa-base has likely been trained on proper nouns that are primarily title-cased.
|
28 |
+
In contrast, because BabyBERTa is not case-sensitive, its performance is not influenced by this change.
|
29 |
+
2. The latest version of Zorro no longer contains ambiguous content words such as "Spanish" which can be both a noun and an adjective.
|
30 |
+
this resulted in a small reduction in the performance of BabyBERTa.
|
31 |
+
|
32 |
+
|
33 |
+
### Additional Information
|
34 |
+
|
35 |
+
This model was trained by [Philip Huebner](https://philhuebner.com), currently at the [UIUC Language and Learning Lab](http://www.learninglanguagelab.org).
|
36 |
+
|
37 |
+
More info can be found [here](https://github.com/phueb/BabyBERTa).
|