tokenmonster / README.md
alasdairforsythe's picture
Update README.md
c47476f
|
raw
history blame
1.01 kB
metadata
license: mit

TokenMonster

The documentation and code is available on Github alasdairforsythe/tokenmonster.

The pretrained vocabularies are all available for download here.

July 11: TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day.

Choose a dataset from:

  • code
  • english
  • englishcode
  • fiction

Choose a vocab size from:

  • 1024
  • 2048
  • 4096
  • 8000
  • 16000
  • 24000
  • 32000
  • 40000
  • 50256
  • 65536
  • 100256

Choose an optimization mode from:

  • unfiltered
  • clean
  • balanced
  • consistent
  • strict

For a capcode disabled vocabulary add:

  • nocapcode

And finally add the version number:

  • v1

Examples:

  • fiction-24000-consistent-v1
  • code-4096-clean-nocapcode-v1

There are two additional vocabularies:

  • gpt2
  • llama