NeelNanda's picture
Update README.md
55af551
|
raw
history blame
1.06 kB

This is a fork of the GPT NeoX 20B tokenizer, edited to split every numerical digit into a separate token. This has the goal of making it easier for the model to learn arithmetic capabilities and to hopefully be more interpretable, and copies the idea from the PaLM tokenizer.

This was done, extremely hackily, by just removing every token that contained "\d\d" (eg "2013"). All remaining digit containing tokens are "0" ... "9" and " 0" ... " 9"

This comes at the cost of making modelling normal text harder, since eg dates like 2013 which naturally should be a single token are now 2|0|1|3.

This has a reduced vocab size of 48252 (several of the tokens towards the end are special whitespace tokens copied in from GPT-NeoX to make tokenizing code easier - some of these are duplicated in the vocabulary and thus may not actually show up at train time).

It includes a padding token (<|PAD|>) an End-Of-String token (<|EOS|>) and a Beginning-Of-String token (<|BOS|>)