AIGym Custom Tokenizer (CL200K)

Overview

The AIGym CL200K Tokenizer is a custom tokenizer designed for pretraining large language models. It is based on Meta-Llama-3-8B and trained on the AIGym Pretraining Corpus.

Features

Built on Meta-Llama-3-8B
Supports a vocabulary size of 200K tokens
Optimized for educational, programming, multilingual, and mathematical texts
Includes custom PAD token

Usage

Loading the Tokenizer

To use the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AIGym/cl200k")
text = "Hello, world!"
tokens = tokenizer.encode(text)
print(tokens)