Unexpected Results from Tokenizer
#85
by
justshao
- opened
When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.
Example:
model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))
will yield:
{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}
Offset_mapping should have tuples representing (char_start, char_end)
for each token.