Transformers
English

ArlowGPT Tokenizer

This repository contains a custom-trained BPE tokenizer for ArlowGPT, created by Yuchen Xie.

Tokenizer Details

  • Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 131,072 tokens
  • Special Tokens:
    • Start of Text: <|startoftext|>
    • End of Text: <|endoftext|>
    • Padding: <|pad|>
    • Unknown: <|unk|>
    • Mask: <|mask|>
    • Message Start: <|im_start|>
    • Message End: <|im_end|>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yuchenxie/arlowgpt-tokenizer-v2")

Training Details

This tokenizer was trained on the 10B GPT-2 randomly shuffled tokens under a custom script composed by Yuchen Xie. This tokenizer is compatible with HuggingFace Transformer's Auto Tokenizer class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuchenxie/arlowgpt-tokenizer-v2

Finetuned
(1)
this model

Dataset used to train yuchenxie/arlowgpt-tokenizer-v2