dominguesm
/

legal-bert-tokenizer

Model card Files Files and versions Community

dominguesm commited on Nov 15, 2022

Commit

c213c3d

•

1 Parent(s): d8e9d4a

Create README.md

Files changed (1) hide show

README.md +52 -0

README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+---
+language: pt
+license: cc-by-sa-4.0
+---
+# LegalBERT Tokenizer
+**LegalBERT** tokenizer is a word level byte-pair encoding with
+vocabulary size of 52k tokens (containing the most common words in legal documents), based on the [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) tokenizer. The tokenizer was trained on data provided by the **BRAZILIAN SUPREME FEDERAL TRIBUNAL**, through the terms of use: [LREC 2020](https://ailab.unb.br/victor/lrec2020).
+Tokenizer utilize `BertTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).
+**NOTE**: The results of this project do not imply in any way the position of the BRAZILIAN SUPREME FEDERAL TRIBUNAL, all being the sole and exclusive responsibility of the author.
+## Tokenizer usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-tokenizer")
+example = ""
+tokens = tokenizer.tokenize(example)
+```
+### Comparison of results
+Original Text: ```De ordem, a Secretaria Judiciária do Supremo Tribunal Federal INTIMA a parte abaixo identificada, ou quem as suas vezes fizer, do inteiro teor do(a) despacho/decisão presente nos autos (art. 270 do Código de Processo Cívil e art 5º da Lei 11.419/2006).```
+| Tokenizer | Tokens | Num. Tokens |
+| --------- | ------ | ----------- |
+| BERTimbau | ```['De', 'ordem', ',', 'a', 'Secretaria', 'Judic', '##iária', 'do', 'Supremo', 'Tribunal', 'Federal', 'IN', '##TI', '##MA', 'a', 'parte', 'abaixo', 'identificada', ',', 'ou', 'quem', 'as', 'suas', 'vezes', 'fiz', '##er', ',', 'do', 'inteiro', 'teor', 'do', '(', 'a', ')', 'despa', '##cho', '/', 'decisão', 'presente', 'nos', 'auto', '##s', '(', 'art', '.', '27', '##0', 'do', 'Código', 'de', 'Processo', 'Cí', '##vil', 'e', 'art', '[UNK]', 'da', 'Lei', '11', '.', '41', '##9', '/', '2006', ')', '.']``` | 66 |
+| LegalBERT | ```['De', 'ordem', ',', 'a', 'Secretaria', 'Judiciária', 'do', 'Supremo', 'Tribunal', 'Federal', 'INTIMA', 'a', 'parte', 'abaixo', 'identificada', ',', 'ou', 'quem', 'as', 'suas', 'vezes', 'fizer', ',', 'do', 'inteiro', 'teor', 'do', '(', 'a', ')', 'despacho', '/', 'decisão', 'presente', 'nos', 'autos', '(', 'art', '.', '270', 'do', 'Código', 'de', 'Processo', 'Cív', '##il', 'e', 'art', '5º', 'da', 'Lei', '11', '.', '419', '/', '2006', ')', '.']``` | 58 |
+## Citation
+If you use this tokenizer, please cite:
+```
+@misc {maicon_domingues_2022,
+	author       = { {Maicon Domingues} },
+	title        = { legal-bert-tokenizer (Revision d8e9d4a) },
+	year         = 2022,
+	url          = { https://huggingface.co/dominguesm/legal-bert-tokenizer },
+	doi          = { 10.57967/hf/0110 },
+	publisher    = { Hugging Face }
+}
+```
+## Contacts:
+* <a href="mailto:dominguesm@outlook.com">dominguesm@outlook.com</a>
+* [NLP.ROCKS](http://nlp.rocks)