File size: 881 Bytes
3c9e283
90dc4d7
3c9e283
90dc4d7
 
 
16fdda0
 
3c9e283
16fdda0
 
 
 
 
3d13aab
16fdda0
 
90dc4d7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
This is the tokenizer used by the Sabiá-2 Medium model.

Sabiá2 Medium is a proprietary LLM that can be used through an API endpoint, which we refer to as the "MariTalk API", or a downloadable version that can be used locally and is encrypted, known as "MariTalk Local".

The purpose of including this tokenizer is to allow you to estimate the number of tokens in your prompts and, therefore, the cost of using the model.

```python
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("maritaca-ai/sabia-2-tokenizer-medium")

prompt = "Com quantos paus se faz uma canoa?"

tokens = tokenizer.encode(prompt)

print(f'O prompt "{prompt}" contém {len(tokens)} tokens.')  # It should print 12 tokens.
```

For more information on how to use the model, please refer to our documentation at [this link](https://maritaca-ai.github.io/maritalk-api/maritalk.html).