---
language: pt
tags:
- legal
license: cc-by-sa-4.0
---

# LegalBERT Tokenizer

**LegalBERT** tokenizer is a word level byte-pair encoding with
vocabulary size of 52k tokens (containing the most common words in legal documents), based on the [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) tokenizer. The tokenizer was trained on data provided by the **BRAZILIAN SUPREME FEDERAL TRIBUNAL**, through the terms of use: [LREC 2020](https://ailab.unb.br/victor/lrec2020).
Tokenizer utilize `BertTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).

**NOTE**: The results of this project do not imply in any way the position of the BRAZILIAN SUPREME FEDERAL TRIBUNAL, all being the sole and exclusive responsibility of the author.

## Tokenizer usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-tokenizer")

example = ""
tokens = tokenizer.tokenize(example)
```

### Comparison of results

**Original Text**: ```De ordem, a Secretaria Judiciária do Supremo Tribunal Federal INTIMA a parte abaixo identificada, ou quem as suas vezes fizer, do inteiro teor do(a) despacho/decisão presente nos autos (art. 270 do Código de Processo Cívil e art 5º da Lei 11.419/2006).```

| Tokenizer | Tokens | Num. Tokens |
| --------- | ------ | ----------- |
| BERTimbau | ```['De', 'ordem', ',', 'a', 'Secretaria', 'Judic', '##iária', 'do', 'Supremo', 'Tribunal', 'Federal', 'IN', '##TI', '##MA', 'a', 'parte', 'abaixo', 'identificada', ',', 'ou', 'quem', 'as', 'suas', 'vezes', 'fiz', '##er', ',', 'do', 'inteiro', 'teor', 'do', '(', 'a', ')', 'despa', '##cho', '/', 'decisão', 'presente', 'nos', 'auto', '##s', '(', 'art', '.', '27', '##0', 'do', 'Código', 'de', 'Processo', 'Cí', '##vil', 'e', 'art', '[UNK]', 'da', 'Lei', '11', '.', '41', '##9', '/', '2006', ')', '.']``` | 66 |
| LegalBERT | ```['De', 'ordem', ',', 'a', 'Secretaria', 'Judiciária', 'do', 'Supremo', 'Tribunal', 'Federal', 'INTIMA', 'a', 'parte', 'abaixo', 'identificada', ',', 'ou', 'quem', 'as', 'suas', 'vezes', 'fizer', ',', 'do', 'inteiro', 'teor', 'do', '(', 'a', ')', 'despacho', '/', 'decisão', 'presente', 'nos', 'autos', '(', 'art', '.', '270', 'do', 'Código', 'de', 'Processo', 'Cív', '##il', 'e', 'art', '5º', 'da', 'Lei', '11', '.', '419', '/', '2006', ')', '.']``` | 58 |


## Citation

If you use this tokenizer, please cite:
```
@misc {maicon_domingues_2022,
	author       = { {Maicon Domingues} },
	title        = { legal-bert-tokenizer (Revision d8e9d4a) },
	year         = 2022,
	url          = { https://huggingface.co/dominguesm/legal-bert-tokenizer },
	doi          = { 10.57967/hf/0110 },
	publisher    = { Hugging Face }
}
```

## Contacts:

* <a href="mailto:dominguesm@outlook.com">dominguesm@outlook.com</a>
* [NLP.ROCKS](http://nlp.rocks)