why 'microsoft/trocr-small-printed' don't have vocab.json?

by yaop - opened Sep 13, 2022

yaop

Sep 13, 2022

Hello, thank you for your great job ,now ,i have a question,why 'microsoft/trocr-small-printed' don't have vocab.json?where is it?

wmax

Sep 14, 2022

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:

pip install sentencepiece

and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

nielsr

Sep 14, 2022

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

yaop

Oct 14, 2022

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

Thank you,it really helped me.

yaop

Oct 14, 2022

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepiece
and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

Thanks a lot , i will try it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment