Cohere multilingual-22-12
tokenizer
This is the tokenizer for the Cohere multilingual-22-12
embedding model: Cohere Multilingual Embeddings
You can load it with the transformers library like this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)
inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)
number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)
Computing number of tokens
The following values can be used to approximate the number of tokens given the number input characters:
approx_number_of_tokens = len(input_text) / ratio
E.g. for English, approx_number_of_tokens = len(input_text) / 4.8
.
Language | Avg. characters per token |
---|---|
ar | 3.6 |
de | 4.6 |
en | 4.8 |
es | 4.6 |
fr | 4.4 |
hi | 3.8 |
it | 4.5 |
ja | 1.3 |
ko | 2.0 |
zh | 1.1 |
These values have been computed on the first 10,000 paragraphs from Wikipedia. For other dataset, these values might change.