ARCHON Tokenizer v2

BPE tokenizer used by ARCHON ASI. Custom 32K vocabulary + 6 ChatML special tokens.

Vocab info

  • Base vocab: 32,000 BPE tokens (custom ARCHON corpus)
  • Added: 6 ChatML/tool-calling specials
  • Total: 32006

Special tokens

Token ID Use
<pad> 0 padding
<bos> 1 begin of sequence
<eos> 2 end of sequence
`< im_start >`
`< im_end >`
`< system >`
`< user >`
`< assistant >`
`< tool_call >`
`< tool_result >`
`< task_type >`

Chat template (ChatML)

Available via tokenizer.apply_chat_template(messages). Renders ChatML format.

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("jescy525/archon-tokenizer-v2")
text = tok.apply_chat_template(
    [{"role": "user", "content": "Hello ARCHON"}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok.encode(text)

Roundtrip safety

Encoding adds <bos> and <eos> by default. Set add_special_tokens=False to skip.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support