BEE-spoke-data
/

BeeTokenizer

Model card Files Files and versions Community

pszemraj commited on Jul 20, 2024

Commit

44306b6

·

verified ·

1 Parent(s): ba51bc8

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -27,9 +27,8 @@ print(f"Tokens:\n\t{output.input_ids}")
 ## Notes
-1. the default tokenizer (on branch `main`) has a vocab size of 32100.
-    - use a model vocab size of 32128 because GPUs like this better
 <details>
   <summary>How to Tokenize Text and Retrieve Offsets</summary>

 ## Notes
+1. the default tokenizer (on branch `main`) has a vocab size of 32000
+2. based on the `SentencePieceBPETokenizer` class
 <details>
   <summary>How to Tokenize Text and Retrieve Offsets</summary>