why 'microsoft/trocr-small-printed' don't have vocab.json?

#3
by yaop - opened

Hello, thank you for your great job ,now ,i have a question,why 'microsoft/trocr-small-printed' don't have vocab.json?where is it?

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:

pip install sentencepiece

and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

Microsoft org

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

Thank you,it really helped me.

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:

pip install sentencepiece

and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

Thanks a lot , i will try it.

Sign up or log in to comment