Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,17 @@
|
|
1 |
# ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
|
2 |
|
3 |
-
### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech Tikal Tech based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
|
4 |
|
5 |
You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
|
6 |
|
7 |
If you use our library in your academic work, please cite us in the following way
|
8 |
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
--------------
|
12 |
|
@@ -22,21 +27,21 @@ If you use our library in your academic work, please cite us in the following wa
|
|
22 |
2. [ Word2Vec/Doc2Vec ](#3.2)
|
23 |
3. [ FastText ](#3.3)
|
24 |
4. [ BERTikal ](#3.4)
|
25 |
-
4. [ Demonstrations/Tutorials](#4)
|
26 |
5. [ References](#5)
|
27 |
|
28 |
--------------
|
29 |
|
30 |
<a name="0"></a>
|
31 |
-
## 0\.
|
32 |
|
33 |
|
34 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
35 |
|
36 |
-
Some models can be download directly using our function `get_premodel
|
37 |
|
38 |
|
39 |
-
Please contact *felipemaiapolo@gmail.com* if you have
|
40 |
|
41 |
--------------
|
42 |
|
@@ -53,7 +58,8 @@ $ pip install git+https://github.com/felipemaiapolo/legalnlp
|
|
53 |
You can load all our functions running the following command
|
54 |
|
55 |
```python
|
56 |
-
from legalnlp import *
|
|
|
57 |
```
|
58 |
|
59 |
|
@@ -109,7 +115,7 @@ Function to download a pre-trained model in the same folder as the file that is
|
|
109 |
- **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
|
110 |
- **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
|
111 |
- **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
|
112 |
-
- **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
|
113 |
- **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
114 |
- **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
115 |
|
@@ -121,7 +127,7 @@ Function to download a pre-trained model in the same folder as the file that is
|
|
121 |
|
122 |
#### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
|
123 |
|
124 |
-
Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/
|
125 |
|
126 |
|
127 |
**Input:**
|
@@ -339,7 +345,7 @@ Below we have a summary table with some important information about the trained
|
|
339 |
|
340 |
|
341 |
| Filenames | FastText | Sizes | Windows
|
342 |
-
|
343 |
| ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
|
344 |
| ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
|
345 |
|
@@ -432,12 +438,12 @@ bert_model = BertModel.from_pretrained('model_bertikal/')
|
|
432 |
|
433 |
For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
|
434 |
|
435 |
-
- **BERT notebook** :
|
436 |
-
[
|
437 |
-
- **Word2Vec notebook** :
|
438 |
-
[
|
439 |
-
- **Doc2Vec notebook** :
|
440 |
-
[
|
441 |
|
442 |
|
443 |
|
1 |
# ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
|
2 |
|
3 |
+
### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
|
4 |
|
5 |
You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
|
6 |
|
7 |
If you use our library in your academic work, please cite us in the following way
|
8 |
|
9 |
+
@article{polo2021legalnlp,
|
10 |
+
title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
|
11 |
+
author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
|
12 |
+
journal={arXiv preprint arXiv:2110.15709},
|
13 |
+
year={2021}
|
14 |
+
}
|
15 |
|
16 |
--------------
|
17 |
|
27 |
2. [ Word2Vec/Doc2Vec ](#3.2)
|
28 |
3. [ FastText ](#3.3)
|
29 |
4. [ BERTikal ](#3.4)
|
30 |
+
4. [ Demonstrations / Tutorials](#4)
|
31 |
5. [ References](#5)
|
32 |
|
33 |
--------------
|
34 |
|
35 |
<a name="0"></a>
|
36 |
+
## 0\. Accessing the Language Models
|
37 |
|
38 |
|
39 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
40 |
|
41 |
+
Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
|
42 |
|
43 |
|
44 |
+
Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the language models.
|
45 |
|
46 |
--------------
|
47 |
|
58 |
You can load all our functions running the following command
|
59 |
|
60 |
```python
|
61 |
+
from legalnlp.clean_functions import *
|
62 |
+
from legalnlp.get_premodel import *
|
63 |
```
|
64 |
|
65 |
|
115 |
- **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
|
116 |
- **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
|
117 |
- **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
|
118 |
+
- **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
|
119 |
- **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
120 |
- **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
121 |
|
127 |
|
128 |
#### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
|
129 |
|
130 |
+
Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
|
131 |
|
132 |
|
133 |
**Input:**
|
345 |
|
346 |
|
347 |
| Filenames | FastText | Sizes | Windows
|
348 |
+
|:-------------------:|:--------------:|:--------------:|:--------------:|
|
349 |
| ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
|
350 |
| ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
|
351 |
|
438 |
|
439 |
For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
|
440 |
|
441 |
+
- **BERT notebook** : click
|
442 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
|
443 |
+
- **Word2Vec notebook** : click
|
444 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
|
445 |
+
- **Doc2Vec notebook** : click
|
446 |
+
[here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)
|
447 |
|
448 |
|
449 |
|