--- license: apache-2.0 --- # Cohere `multilingual-22-12` tokenizer This is the tokenizer for the Cohere `multilingual-22-12` embedding model: [Cohere Multilingual Embeddings](https://docs.cohere.ai/docs/multilingual-language-models) You can load it with the transformers library like this: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Cohere/multilingual-22-12") text = "Hellö World, this is my input string!" enc = tokenizer(text) print("Encoded input:") print(enc) inv_vocab = {v: k for k, v in tokenizer.vocab.items()} tokens = [inv_vocab[token_id] for token_id in enc['input_ids']] print("Tokens:") print(tokens) number_of_tokens = len(enc['input_ids']) print("Number of tokens:", number_of_tokens) ``` ## Computing number of tokens The following values can be used to approximate the number of tokens given the number input characters: ``` approx_number_of_tokens = len(input_text) / ratio ``` E.g. for English, `approx_number_of_tokens = len(input_text) / 4.8`. | Language | Avg. characters per token | | --- | :---: | | ar | 3.6 | | de | 4.6 | | en | 4.8 | | es | 4.6 | | fr | 4.4 | | hi | 3.8 | | it | 4.5 | | ja | 1.3 | | ko | 2.0 | | zh | 1.1 | These values have been computed on the first 10,000 paragraphs from [Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12). For other dataset, these values might change.