Where to find the token ids of the tokenizer ?
#22
by
Mohamed123321
- opened
Hello,
I was wondering how can I access and change the tokenizer's token ids ?
Thanks !
I may add that I speak about the mapping from tokens (part of words) and ids
Hey! The tokenizer by default is based on sentencepiece. You can't really change it but you can add tokens using add_tokens
and see the vocab using tokenizer.get_vocab()
sentence = "What time is it, Tom?"
sentence_encoded = tokenizer(sentence, return_tensors='pt')
sentence_decoded = tokenizer.decode(
sentence_encoded["input_ids"][0],
skip_special_tokens=True
)
print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)
ENCODED SENTENCE:
tensor([ 363, 97, 19, 34, 6, 3059, 58, 1])
DECODED SENTENCE:
What time is it, Tom?
If this helps.