vocab.txt

#7
by jzhang86 - opened

where can I find the vocab.txt for this multilingual model?

The vocabulary is based on sentencepiece instead of word piece like BERT.

You can use the following code to print the vocab:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
print(tokenizer.vocab)

@intfloat Thank you. So you are saying I can write the vocab.txt with the tokenizer.vocab value? I don't know why the multilingual e5 models don't come with vocab.txt just like the english e5 model does.
The reason I am asking is I am trying to convert this model to ggml format using bert.cpp, which requires vocab.txt.

As far as I know, only models based on bert have vocab.txt, models like t5 and xlm-roberta do not have this file.

Multilingual e5 models are based on xlm-roberta instead of bert.

I guess you should not try to run this model with bert codebase.

@intfloat This model supports 94 languages. How to choose only specific languages from the list? I need only 40 languages

Sign up or log in to comment