AIGym Custom Tokenizer (CL200K)
Overview
The AIGym CL200K Tokenizer is a custom tokenizer designed for pretraining large language models. It is based on Meta-Llama-3-8B and trained on the AIGym Pretraining Corpus.
Features
- Built on Meta-Llama-3-8B
- Supports a vocabulary size of 200K tokens
- Optimized for educational, programming, multilingual, and mathematical texts
- Includes custom PAD token
Usage
Loading the Tokenizer
To use the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AIGym/cl200k")
text = "Hello, world!"
tokens = tokenizer.encode(text)
print(tokens)
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.