|
--- |
|
license: mit |
|
--- |
|
## TokenMonster |
|
|
|
The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster). |
|
|
|
The pretrained vocabularies are all available for download [here](https://huggingface.co/alasdairforsythe/tokenmonster/tree/main/vocabs). |
|
|
|
**July 11:** TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day. |
|
|
|
Choose a dataset from: |
|
|
|
- `code` |
|
- `english` |
|
- `englishcode` |
|
- `fiction` |
|
|
|
Choose a vocab size from: |
|
- `1024` |
|
- `2048` |
|
- `4096` |
|
- `8000` |
|
- `16000` |
|
- `24000` |
|
- `32000` |
|
- `40000` |
|
- `50256` |
|
- `65536` |
|
- `100256` |
|
|
|
Choose an optimization mode from: |
|
- `unfiltered` |
|
- `clean` |
|
- `balanced` |
|
- `consistent` |
|
- `strict` |
|
|
|
For a capcode disabled vocabulary add: |
|
- `nocapcode` |
|
|
|
And finally add the version number: |
|
- `v1` |
|
|
|
Examples: |
|
- `fiction-24000-consistent-v1` |
|
- `code-4096-clean-nocapcode-v1` |
|
|
|
There are two additional vocabularies: |
|
- `gpt2` |
|
- `llama` |