What are the <0x00> to <0xFF> in the tokenizer.json

#16
by jiang719 - opened

What are these tokens in the tokenizer.

If I run this line

tokenizer.convert_ids_to_tokens(tokenizer.encode('int add(int a, int b) {\n    return a + b;\n}'))

It gives me

['<s>', '▁int', '▁add', '(', 'int', '▁a', ',', '▁int', '▁b', ')', '▁{', '<0x0A>', '▁▁▁', '▁return', '▁a', '▁+', '▁b', ';', '<0x0A>', '}']

Looks like <0x0A> is used as the newline. Initially, I thought these tokens are special tokens for hex-decimal values.

Is this supposed to be correct? What the other tokens mean?

Code Llama org

I think it is the Bytefallback, which converts some tokens (in that case new lines) to unicode representation

Sign up or log in to comment