Tokenizer Issue
#2
by
Saptarshi7
- opened
Hi,
I'm trying to use LUKE on SQuAD. So, when I try to process the training examples with the tokenizer, I get this error:
NotImplementedError: return_offset_mapping is not available when using Python tokenizers. To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.
Could you tell me what the issue might be? I am running the latest HuggingFace version.
The problem stems from the fact that the fast tokenizer is not implemented for LukeTokenizer
, so it cannot be used with return_offsets_mapping=True
.
As a workaround, you can use roberta-base
for tokenizing text, which has its fast tokenizer implementation and the same wordpiece vocabulary as luke-base
.
Example:
>> model_name = "roberta-base"
>> tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
>> tokenizer.encode_plus("this is test", return_offsets_mapping=True)
{'input_ids': [0, 9226, 16, 1296, 2], 'attention_mask': [1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (5, 7), (8, 12), (0, 0)]}
Yes, thank you. This worked!
Saptarshi7
changed discussion status to
closed