murawaki commited on
Commit
46ef41f
1 Parent(s): c963228

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -41,7 +41,7 @@ You can also use this model to get the features of a given text.
41
 
42
  A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
43
 
44
- Note that the tokenizer maps U+0020 to `[UNK]` because preprocessing eliminated whitespace characters (U+0020) from training data.
45
 
46
  ## Training data
47
 
 
41
 
42
  A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.
43
 
44
+ Note that the tokenizer maps U+0020 to `[UNK]` because preprocessing eliminated whitespace characters (U+0020) from training data. Use U+3000 (Ideographic Space) instead.
45
 
46
  ## Training data
47