metadata
license: mit
TokenMonster
The documentation and code is available on Github alasdairforsythe/tokenmonster.
The pretrained vocabularies are all available for download here.
July 11: TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day.
Choose a dataset from:
code
english
englishcode
fiction
Choose a vocab size from:
1024
2048
4096
8000
16000
24000
32000
40000
50256
65536
100256
Choose an optimization mode from:
unfiltered
clean
balanced
consistent
strict
For a capcode disabled vocabulary add:
nocapcode
And finally add the version number:
v1
Examples:
fiction-24000-consistent-v1
code-4096-clean-nocapcode-v1
There are two additional vocabularies:
gpt2
llama