Cohere multilingual-22-12 tokenizer
This is the tokenizer for the Cohere multilingual-22-12 embedding model: Cohere Multilingual Embeddings
You can load it with the transformers library like this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12")
text = "Hellö World, this is my input string!"
enc = tokenizer(text)
print("Encoded input:")
print(enc)
inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
tokens = [inv_vocab[token_id] for token_id in enc['input_ids']]
print("Tokens:")
print(tokens)
number_of_tokens = len(enc['input_ids'])
print("Number of tokens:", number_of_tokens)
Computing number of tokens
The following values can be used to approximate the number of tokens given the number input characters:
approx_number_of_tokens = len(input_text) / ratio
E.g. for English, approx_number_of_tokens = len(input_text) / 4.8.
| Language | Avg. characters per token |
|---|---|
| ar | 3.6 |
| de | 4.6 |
| en | 4.8 |
| es | 4.6 |
| fr | 4.4 |
| hi | 3.8 |
| it | 4.5 |
| ja | 1.3 |
| ko | 2.0 |
| zh | 1.1 |
These values have been computed on the first 10,000 paragraphs from Wikipedia. For other dataset, these values might change.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support