phueb
/

BabyBERTa-1

@@ -5,6 +5,16 @@
 BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
 It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
 ### Performance
 The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.

 BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
 It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
+## Loading the tokenizer
+BabyBERTa was trained with `add_prefix_space=False`, so it will not work properly with the tokenizer defaults.
+Make sure to load the tokenizer as follows:
+```python
+tokenizer = RobertaTokenizerFast.from_pretrained("phueb/BabyBERTa",
+                                                 add_prefix_space=False)
+```
 ### Performance
 The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.