alasdairforsythe
/

tokenmonster

Model card Files Files and versions Community

tokenmonster / README.md

alasdairforsythe's picture

alasdairforsythe

Update README.md

c47476f over 1 year ago

|

history blame contribute delete

No virus

1.01 kB

	---
	license: mit
	---
	## TokenMonster

	The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).

	The pretrained vocabularies are all available for download [here](https://huggingface.co/alasdairforsythe/tokenmonster/tree/main/vocabs).

	July 11: TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day.

	Choose a dataset from:

	- `code`
	- `english`
	- `englishcode`
	- `fiction`

	Choose a vocab size from:
	- `1024`
	- `2048`
	- `4096`
	- `8000`
	- `16000`
	- `24000`
	- `32000`
	- `40000`
	- `50256`
	- `65536`
	- `100256`

	Choose an optimization mode from:
	- `unfiltered`
	- `clean`
	- `balanced`
	- `consistent`
	- `strict`

	For a capcode disabled vocabulary add:
	- `nocapcode`

	And finally add the version number:
	- `v1`

	Examples:
	- `fiction-24000-consistent-v1`
	- `code-4096-clean-nocapcode-v1`

	There are two additional vocabularies:
	- `gpt2`
	- `llama`