the "dataset" and/or the "datasets" in this repo refers to the first 16384 rows of silicone:dyda_da:train dataset

trained over the gpt2 tokenizer, this tokenizer matches the avg #tokens/datapoint Using only 8192 vocab_size (from the base's 50257)

import transformers
tokenizer=transformers.GPT2TokenizerFast.from_pretrained("umarzein/silicone-dyda-16k-8k-tokenizer")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

umarzein
/

silicone-dyda-16k-8k-tokenizer

Dataset used to train umarzein/silicone-dyda-16k-8k-tokenizer