Tokenizer vocabulary

#28
by DjTobalito - opened

Hi,
Using the XLM Roberta for multilanguage classification with success. I am trying to understand a bit better the tokenizer.
Naively, I expected that common words of small size in the languages present in the dataset to be present in the tokenizer.vocab dictionary.
But it seems that for French for example, the word "oui" (yes in French) is not in the tokenizer.vocab dictionary.

Am I misunderstanding the tokenizer.vocab dictionary ?

Sign up or log in to comment