ARCHON Tokenizer v2

BPE tokenizer used by ARCHON ASI. Custom 32K vocabulary + 6 ChatML special tokens.

Vocab info

Base vocab: 32,000 BPE tokens (custom ARCHON corpus)
Added: 6 ChatML/tool-calling specials
Total: 32006

Special tokens

Token	ID	Use
`<pad>`	0	padding
`<bos>`	1	begin of sequence
`<eos>`	2	end of sequence
`<	im_start	>`
`<	im_end	>`
`<	system	>`
`<	user	>`
`<	assistant	>`
`<	tool_call	>`
`<	tool_result	>`
`<	task_type	>`

Chat template (ChatML)

Available via tokenizer.apply_chat_template(messages). Renders ChatML format.

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("jescy525/archon-tokenizer-v2")
text = tok.apply_chat_template(
    [{"role": "user", "content": "Hello ARCHON"}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok.encode(text)

Roundtrip safety

Encoding adds <bos> and <eos> by default. Set add_special_tokens=False to skip.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support