phueb commited on
Commit
096814d
1 Parent(s): d64ce14

add info about loading tokenizer

Browse files
Files changed (1) hide show
  1. README.md +10 -0
README.md CHANGED
@@ -5,6 +5,16 @@
5
  BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
6
  It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
7
 
 
 
 
 
 
 
 
 
 
 
8
  ### Performance
9
 
10
  The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.
 
5
  BabyBERTa is a light-weight version of RoBERTa trained on 5M words of American-English child-directed input.
6
  It is intended for language acquisition research, on a single desktop with a single GPU - no high-performance computing infrastructure needed.
7
 
8
+ ## Loading the tokenizer
9
+
10
+ BabyBERTa was trained with `add_prefix_space=False`, so it will not work properly with the tokenizer defaults.
11
+ Make sure to load the tokenizer as follows:
12
+
13
+ ```python
14
+ tokenizer = RobertaTokenizerFast.from_pretrained("phueb/BabyBERTa",
15
+ add_prefix_space=False)
16
+ ```
17
+
18
  ### Performance
19
 
20
  The provided model is the best-performing out of 10 that were evaluated on the [Zorro](https://github.com/phueb/Zorro) test suite.