Unexpected Results from Tokenizer

#85
by justshao - opened

When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.

Example:

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))

will yield:

{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

Offset_mapping should have tuples representing (char_start, char_end) for each token.

Sign up or log in to comment