add info about loading tokenizer
Browse files
README.md
CHANGED
@@ -5,6 +5,16 @@
|
|
5 |
BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
|
6 |
It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
### Performance
|
9 |
|
10 |
The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.
|
|
|
5 |
BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
|
6 |
It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
|
7 |
|
8 |
+
## Loading the tokenizer
|
9 |
+
|
10 |
+
BabyBERTa was trained with `add_prefix_space=False`, so it will not work properly with the tokenizer defaults.
|
11 |
+
Make sure to load the tokenizer as follows:
|
12 |
+
|
13 |
+
```python
|
14 |
+
tokenizer = RobertaTokenizerFast.from_pretrained("phueb/BabyBERTa",
|
15 |
+
add_prefix_space=False)
|
16 |
+
```
|
17 |
+
|
18 |
### Performance
|
19 |
|
20 |
The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.
|