Regarding tokenizer.json file

#2
by KamalG - opened

Hi
I am trying to generate single tokenizer.json or tokenizer.model file from source.spm and target.spm files. I saw tokenizer.json in this repo but still failing to tokenize without .spm files. Can you please tell me how did you generate single tokenizer.json, and use it for tokenization.

Hi. Thanks for the response. I went through the code you pointed. I need to perform translation using Helsinki-NLP/opus-mt-it-en model. In my work, the inference pipeline expects tokenizer.json and I can't tokenize using .spm files. So i am trying to perform tokenization using tokenizer.json file. I generated tokenizer.json using your method.
For example, if I take any of your model from Xenova/opus-mt and pass the the model path through Mariantokenizer or Autotokenizer, without .spm files in that directory, model fails to read the directory.
Query: Is there some way to use huggingface model and tokenize using tokenizer.json and not with .spm files.
Please let me know if the question is clear to you.
Note: I understand that Autotokenizer and MarianTokenizer are coded in a way that they expect .spm files. Currently I am looking for solution to use tokenizer.json is some other way without any need to change Auto/Marian tokenizer.

Sign up or log in to comment