Projeto commited on
Commit
4f86f21
1 Parent(s): c63a5d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -16
README.md CHANGED
@@ -1,12 +1,17 @@
1
  # ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
2
 
3
- ### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech Tikal Tech based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
4
 
5
  You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
6
 
7
  If you use our library in your academic work, please cite us in the following way
8
 
9
-
 
 
 
 
 
10
 
11
  --------------
12
 
@@ -22,21 +27,21 @@ If you use our library in your academic work, please cite us in the following wa
22
  2. [ Word2Vec/Doc2Vec ](#3.2)
23
  3. [ FastText ](#3.3)
24
  4. [ BERTikal ](#3.4)
25
- 4. [ Demonstrations/Tutorials](#4)
26
  5. [ References](#5)
27
 
28
  --------------
29
 
30
  <a name="0"></a>
31
- ## 0\. Address for Language Models
32
 
33
 
34
  All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
35
 
36
- Some models can be download directly using our function `get_premodel`.
37
 
38
 
39
- Please contact *felipemaiapolo@gmail.com* if you have some problem accessing the language models.
40
 
41
  --------------
42
 
@@ -53,7 +58,8 @@ $ pip install git+https://github.com/felipemaiapolo/legalnlp
53
  You can load all our functions running the following command
54
 
55
  ```python
56
- from legalnlp import *
 
57
  ```
58
 
59
 
@@ -109,7 +115,7 @@ Function to download a pre-trained model in the same folder as the file that is
109
  - **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
110
  - **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
111
  - **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
112
- - **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
113
  - **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
114
  - **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
115
 
@@ -121,7 +127,7 @@ Function to download a pre-trained model in the same folder as the file that is
121
 
122
  #### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
123
 
124
- Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
125
 
126
 
127
  **Input:**
@@ -339,7 +345,7 @@ Below we have a summary table with some important information about the trained
339
 
340
 
341
  | Filenames | FastText | Sizes | Windows
342
- |:-------------------:|:----------------:|:--------------:|:--------------:|
343
  | ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
344
  | ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
345
 
@@ -432,12 +438,12 @@ bert_model = BertModel.from_pretrained('model_bertikal/')
432
 
433
  For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
434
 
435
- - **BERT notebook** :
436
- [https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb](https://github.com/legalnlp21/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
437
- - **Word2Vec notebook** :
438
- [https://github.com/legalnlp21/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb](https://github.com/legalnlp21/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
439
- - **Doc2Vec notebook** :
440
- [https://github.com/legalnlp21/legalnlp/tree/main/demo/Doc2Vec](https://github.com/legalnlp21/legalnlp/tree/main/demo/Doc2Vec)
441
 
442
 
443
 
1
  # ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
2
 
3
+ ### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
4
 
5
  You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
6
 
7
  If you use our library in your academic work, please cite us in the following way
8
 
9
+ @article{polo2021legalnlp,
10
+ title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
11
+ author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
12
+ journal={arXiv preprint arXiv:2110.15709},
13
+ year={2021}
14
+ }
15
 
16
  --------------
17
 
27
  2. [ Word2Vec/Doc2Vec ](#3.2)
28
  3. [ FastText ](#3.3)
29
  4. [ BERTikal ](#3.4)
30
+ 4. [ Demonstrations / Tutorials](#4)
31
  5. [ References](#5)
32
 
33
  --------------
34
 
35
  <a name="0"></a>
36
+ ## 0\. Accessing the Language Models
37
 
38
 
39
  All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
40
 
41
+ Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
42
 
43
 
44
+ Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the language models.
45
 
46
  --------------
47
 
58
  You can load all our functions running the following command
59
 
60
  ```python
61
+ from legalnlp.clean_functions import *
62
+ from legalnlp.get_premodel import *
63
  ```
64
 
65
 
115
  - **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
116
  - **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
117
  - **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
118
+ - **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
119
  - **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
120
  - **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
121
 
127
 
128
  #### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
129
 
130
+ Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
131
 
132
 
133
  **Input:**
345
 
346
 
347
  | Filenames | FastText | Sizes | Windows
348
+ |:-------------------:|:--------------:|:--------------:|:--------------:|
349
  | ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
350
  | ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
351
 
438
 
439
  For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
440
 
441
+ - **BERT notebook** : click
442
+ [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
443
+ - **Word2Vec notebook** : click
444
+ [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
445
+ - **Doc2Vec notebook** : click
446
+ [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)
447
 
448
 
449