File size: 1,008 Bytes
69c4aa0 72491b3 c47476f 430ce7c c47476f 430ce7c 57672ab 430ce7c 57672ab 430ce7c 57672ab 430ce7c 57672ab 430ce7c 57672ab 430ce7c 57672ab c47476f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
license: mit
---
## TokenMonster
The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
The pretrained vocabularies are all available for download [here](https://huggingface.co/alasdairforsythe/tokenmonster/tree/main/vocabs).
**July 11:** TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day.
Choose a dataset from:
- `code`
- `english`
- `englishcode`
- `fiction`
Choose a vocab size from:
- `1024`
- `2048`
- `4096`
- `8000`
- `16000`
- `24000`
- `32000`
- `40000`
- `50256`
- `65536`
- `100256`
Choose an optimization mode from:
- `unfiltered`
- `clean`
- `balanced`
- `consistent`
- `strict`
For a capcode disabled vocabulary add:
- `nocapcode`
And finally add the version number:
- `v1`
Examples:
- `fiction-24000-consistent-v1`
- `code-4096-clean-nocapcode-v1`
There are two additional vocabularies:
- `gpt2`
- `llama` |