Edit model card

Cohere multilingual-22-12 tokenizer

This is the tokenizer for the Cohere multilingual-22-12 embedding model: Cohere Multilingual Embeddings

You can load it with the transformers library like this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)

inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)

number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)

Computing number of tokens

The following values can be used to approximate the number of tokens given the number input characters:

approx_number_of_tokens = len(input_text) / ratio

E.g. for English, approx_number_of_tokens = len(input_text) / 4.8.

Language Avg. characters per token
ar 3.6
de 4.6
en 4.8
es 4.6
fr 4.4
hi 3.8
it 4.5
ja 1.3
ko 2.0
zh 1.1

These values have been computed on the first 10,000 paragraphs from Wikipedia. For other dataset, these values might change.

Downloads last month
0
Unable to determine this model's library. Check the docs .