--- library_name: transformers tags: - tokenizer - mlm license: mit --- # claude tokenizer: mlm A variant of [Xenova/claude-tokenizer](https://huggingface.co/Xenova/claude-tokenizer) with some small changes to support usage as an MLM tokenizer. ```py from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('pszemraj/claude-tokenizer-mlm') text = "Hello, this is a test input." ids = tokenizer(text) print(tokenizer.decode(ids['input_ids'], skip_special_tokens=False)) # Hello, this is a test input. len(tokenizer) # 65004 ``` details relevant for model configs using this: ```py >>> tokenizer GPT2TokenizerFast(name_or_path='pszemraj/claude-tokenizer-mlm', vocab_size=65000, model_max_length=200000, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '', 'eos_token': '', 'unk_token': '', 'sep_token': '', 'pad_token': '', 'cls_token': '', 'mask_token': ''}, clean_up_tokenization_spaces=True), added_tokens_decoder={ 0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 1: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 2: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 3: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 4: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 65000: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 65001: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 65002: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 65003: AddedToken("", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True), } ``` the `` token is added but unused, both the CLS and BOS tokens are set to `` - see `tokenizer_config.json` for details