Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary

#13
by Seantaud - opened

I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, it failed.

System Info
I was trying to use BioGpt model in my code for fine-tuning. I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary.

Reproduction
I copy the code related to colab.This is the link : https://colab.research.google.com/drive/1IMhiDz45GiarBLgXG9B2rA_u0ZOmmjJS?usp=sharing

Expected behavior
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens in vocab.json or merge.txt. Could you please check it? Thank you very much!

Dude any update ?

Sign up or log in to comment