Projeto commited on
Commit
155b2c4
1 Parent(s): 0b085e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -262
README.md CHANGED
@@ -6,6 +6,8 @@ tags:
6
  - NLP
7
  - legal field
8
  - python
 
 
9
  ---
10
 
11
 
@@ -30,16 +32,10 @@ If you use our library in your academic work, please cite us in the following wa
30
 
31
  0. [Accessing the Language Models](#0)
32
  1. [ Introduction / Installing package](#1)
33
- 2. [Fuctions ](#2)
34
- 1. [ Text Cleaning Functions](#2.1)
35
- 2. [Other Functions](#2.2)
36
- 3. [ Language Models (Details / How to use)](#3)
37
- 1. [ Phraser ](#3.1)
38
- 2. [ Word2Vec/Doc2Vec ](#3.2)
39
- 3. [ FastText ](#3.3)
40
- 4. [ BERTikal ](#3.4)
41
- 4. [ Demonstrations / Tutorials](#4)
42
- 5. [ References](#5)
43
 
44
  --------------
45
 
@@ -49,9 +45,6 @@ If you use our library in your academic work, please cite us in the following wa
49
 
50
  All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
51
 
52
- Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
53
-
54
-
55
  Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the language models.
56
 
57
  --------------
@@ -61,159 +54,30 @@ Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the
61
  *LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.
62
 
63
 
64
- You can install our package running the following command on terminal
65
  ``` :sh
66
- $ pip install git+https://github.com/felipemaiapolo/legalnlp
67
- ```
68
-
69
- You can load all our functions running the following command
70
-
71
- ```python
72
- from legalnlp.clean_functions import *
73
- from legalnlp.get_premodel import *
74
  ```
75
 
76
-
77
- --------------
78
-
79
- <a name="2"></a>
80
- ## 2\. Functions
81
- <a name="2.1"></a>
82
- ### 2.1\. Text Cleaning Functions
83
-
84
-
85
- <a name="2.1.1"></a>
86
- #### 2.1.1\. `clean(text, lower=True, return_masked=False)`
87
- Function for cleaning texts to be used (optional) in conjunction with Doc2Vec, Word2Vec, and FastText models. We use RegEx to mask/extract information such as email addresses, URLs, dates, numbers, monetary values, etc.
88
-
89
- **input**:
90
-
91
- - *text*, **str**;
92
-
93
- - *lower*, **bool**, default=**True**. If lower==True, function lower cases the whole text. Note that all the models (except BERT) were trained with lower cased texts;
94
-
95
- - *return_masked*, **bool**, default=**True**. If return_masked == False, the function outputs a clean text. Otherwise, it returns a dictionary containing the clean text and the information extracted by RegEx;
96
-
97
- **output**:
98
-
99
- - Clean text or dictionary, depending on the *return_masked* parameter;
100
-
101
-
102
- <a name="2.1.2"></a>
103
- #### 2.1.2\.`clean_bert(text)`
104
-
105
- Function for cleaning the texts to be used (optional) in conjunction with the BERT model.
106
-
107
- **input:**
108
-
109
- - *text*, **str**.
110
-
111
- **output:**
112
-
113
- - **str** with clean text.
114
-
115
- <a name="2.2"></a>
116
- ### 2.2\. Other functions
117
-
118
- #### 2.2.2\. `get_premodel(model)`
119
-
120
- Function to download a pre-trained model in the same folder as the file that is being executed.
121
-
122
- **input:**
123
-
124
- - *model*, **str**. Must contain the name of the pre-trained model that one wants to use. There are these options:
125
- - **model = "bert"**: Download a .zip file containing BERTikal model and unzip it.
126
- - **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
127
- - **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
128
- - **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
129
- - **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
130
- - **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
131
- - **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
132
-
133
-
134
- **output:**
135
-
136
- - True if download of some model was made and False otherwise.
137
-
138
-
139
- #### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
140
-
141
- Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
142
-
143
-
144
- **Input:**
145
-
146
- - *path_model*, **str**. Must contain the path of the pre-trained model;
147
-
148
- - *path_tokenizer*, **str**. Must contain the path of tokenizer;
149
-
150
- - *data*, **list**. Must contain a list of texts that will be extracted features;
151
-
152
- - *gpu*, **bool**, default=**True**. If gpu==False, the GPU will not be used in the model application (we recommend feature extraction to be done using Google Colab).
153
-
154
-
155
- **Output:**
156
-
157
- - **DataFrame** with features extracted by BERT model.
158
-
159
-
160
- <a name="3"></a>
161
- ## 3\. Model Languages
162
-
163
- <a name="3.1"></a>
164
- ### 3.1\. Phraser
165
-
166
- Phraser is a statistical method proposed in the natural language processing
167
- literature [1] for identifying which words when they appear
168
- together, can be considered as unique tokens. This method application is able to
169
- identify the relevance of the occurrence of a bigram against the occurrence of the
170
- words that make it up separately. Thus, we can identify that a bigram like "São
171
- Paulo" should be treated as a single token, for example. If the method is applied
172
- a second time in sequence, we can check which are the relevant trigrams and
173
- quadrigrams. Since the two applications should be done with different Phraser
174
- models, it can be the case that the second application identifies bigrams that were
175
- not identified by the first model.
176
-
177
- This model is compatible with the `clean` function, but it is not necessary to use it before. Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/phrases.html) for more details. Preferably use Gensim version 3.8.3.
178
-
179
- #### Using *Phraser*
180
- Installing Gensim
181
-
182
 
183
  ```python
184
- !pip install gensim=='3.8.3'
185
  ```
186
 
187
- Importing package and loading our two Phraser models.
188
-
189
 
190
  ```python
191
- #Importing packages
192
- from gensim.models.phrases import Phraser
193
-
194
- #Loading two Phraser models
195
- phraser1=Phraser.load('models_phraser/phraser1')
196
- phraser2=Phraser.load('models_phraser/phraser2')
197
  ```
198
 
 
199
 
200
- Applying Phraser once and twice to check output
201
-
202
-
203
- ```python
204
- txt='direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios'
205
- tokens=txt.split()
206
 
207
- print('Clean Text: "'+' '.join(tokens)+'"')
208
- print('\nApplying Phraser 1x: "'+' '.join(phraser1[tokens])+'"')
209
- print('\nApplying Phraser 2x: "'+' '.join(phraser2[phraser1[tokens]])+'"')
210
- ```
211
 
212
- Clean Text: "direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios"
213
-
214
- Applying Phraser 1x: "direito do consumidor origem : bangu regional xxix juizado_especial civel_ação : [processo] - - recte : fundo de investimento em direitos_creditórios"
215
-
216
- Applying Phraser 2x: "direito do consumidor origem : bangu_regional xxix juizado_especial_civel_ação : [processo] - - recte : fundo de investimento em direitos_creditórios"
217
 
218
  <a name="3.2"></a>
219
  ### 3.2\. Word2Vec/Doc2Vec
@@ -226,7 +90,7 @@ the meaning of the various textual elements, based on the contexts in which thes
226
  elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
227
  for generating whole text representations.
228
 
229
- The Word2Vec and Doc2Vec methods are presented together in this section because they were trained together using the Gensim package. Both models are compatible with the `clean` function, but it is not necessary to use it before. Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html) for more details. Preferably use Gensim version 3.8.3.
230
 
231
 
232
  Below we have a summary table with some important information about the trained models:
@@ -239,8 +103,7 @@ Below we have a summary table with some important information about the trained
239
  | ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW) | Skip-Gram (SG) | 100, 200, 300 | 15
240
 
241
 
242
-
243
-
244
 
245
  #### Using *Word2Vec*
246
 
@@ -251,14 +114,14 @@ Installing Gensim
251
  !pip install gensim=='3.8.3'
252
  ```
253
 
254
- Loading W2V (all the files for the specific model should be in the same folder)
255
 
256
 
257
  ```python
258
  from gensim.models import KeyedVectors
259
 
260
  #Loading a W2V model
261
- w2v=KeyedVectors.load('models_w2v_d2v/w2v_d2v_dm_size_100_window_15_epochs_20')
262
  w2v=w2v.wv
263
  ```
264
  Viewing the first 10 entries of 'juiz' vector
@@ -307,14 +170,14 @@ Installing Gensim
307
  !pip install gensim=='3.8.3'
308
  ```
309
 
310
- Loading D2V (all the files for the specific model should be in the same folder)
311
 
312
 
313
  ```python
314
  from gensim.models import Doc2Vec
315
 
316
  #Loading a D2V model
317
- d2v=Doc2Vec.load('models_w2v_d2v/w2v_d2v_dm_size_100_window_15_epochs_20')
318
  ```
319
 
320
  Inferring vector for a text
@@ -338,109 +201,6 @@ txt_vec[:10]
338
 
339
 
340
 
341
- <a name="3.3"></a>
342
- ### 3.3\. FastText
343
-
344
- The FastText [4] methods, like Word2Vec, form a class of
345
- models for creating vector representations (embeddings) for tokens. Unlike
346
- Word2Vec, which disregards the morphology of the tokens and allocates a
347
- different vector for each one of them, the FastText methods consider that each one
348
- of the tokens is formed by n-grams of characters or substrings. In this way, the
349
- representation of tokens which do not appear in the training set can be inferred
350
- from the representation of substrings. Also, rare tokens can have more robust
351
- representations than those returned by the Word2Vec methods.
352
-
353
- Models are compatible with the `clean` function, but it is not necessary to use it. Remember to at least make all letters lowercase. Please check our paper or the [Gensim page](https://radimrehurek.com/gensim/models/fasttext.html) for more details. Preferably use Gensim version 4.0.1.
354
-
355
- Below we have a summary table with some important information about the trained models:
356
-
357
- | Filenames | FastText | Sizes | Windows
358
- |:-------------------:|:--------------:|:--------------:|:--------------:|
359
- | ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
360
- | ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
361
-
362
-
363
- #### Using *FastText*
364
-
365
- installing Gensim
366
-
367
-
368
- ```python
369
- !pip install gensim=='4.0.1'
370
- ```
371
-
372
- Loading FastText (all the files for the specific model should be in the same folder)
373
-
374
-
375
- ```python
376
- from gensim.models import FastText
377
-
378
- #Loading a FastText model
379
- fast=FastText.load('models_fasttext/fasttext_sg_size_100_window_15_epochs_20')
380
- fast=fast.wv
381
- ```
382
-
383
- Viewing the first 10 entries of 'juiz' vector
384
-
385
-
386
-
387
- ```python
388
- fast['juiz'][:10]
389
- ```
390
-
391
-
392
-
393
-
394
- array([ 0.46769685, 0.62529474, 0.08549586, 0.09621219, -0.09998254,
395
- -0.07897531, 0.32838237, -0.33229044, -0.05959201, -0.5865443 ],
396
- dtype=float32)
397
-
398
-
399
-
400
- Viewing the first 10 vector entries of a token that was not in our vocabulary
401
-
402
-
403
- ```python
404
- fast['juizasjashdkjhaskda'][:10]
405
- ```
406
-
407
-
408
-
409
-
410
- array([ 0.02795791, 0.1361525 , 0.1340836 , -0.36824217, -0.11549155,
411
- -0.11167661, 0.32045627, -0.33701468, -0.05198409, -0.05513595],
412
- dtype=float32)
413
-
414
-
415
- <a name="3.4"></a>
416
- ### 3.4\. BERTikal
417
-
418
-
419
- We call BERTikal our BERT-Base model (cased) [5] for Brazilian legal language. BERT models are models based on neural network architectures called Transformers. BERT models are trained with large sets of texts using the self-supervised paradigm, which is basically solving unsupervised problems using supervised techniques. A pre-trained BERT model is capable of generating representations for entire texts and can be adapted for a supervised task, e.g., text classification or question answering, using the fine-tuning mechanism.
420
-
421
- BERTikal was trained using the Python package [Transformers](https://huggingface.co/transformers/}) in its 4.2.2 version and its checkpoint made available by us is compatible with [PyTorch](https://pytorch.org/) 1.9.0. Although we expose the versions of both packages, more current versions can be used in applications of the model, as long as there are no relevant version conflicts.
422
-
423
- Our model was trained from the checkpoint made available in [Neuralmind’s Github repository](https://github.com/neuralmind-ai/portuguese-bert) by the authors of recent research [6].
424
-
425
- #### Using *BERTikal*
426
-
427
- Installing Torch e Transformers
428
-
429
-
430
- ```python
431
- !pip install torch=='1.8.1' transformers=='4.2.2'
432
- ```
433
-
434
- Loading BERT (all the files for the specific model should be in the same folder)
435
-
436
-
437
- ```python
438
- from transformers import BertModel, BertTokenizer
439
-
440
- bert_tokenizer = BertTokenizer.from_pretrained('model_bertikal/', do_lower_case=False)
441
- bert_model = BertModel.from_pretrained('model_bertikal/')
442
- ```
443
-
444
  --------------
445
 
446
  <a name="4"></a>
6
  - NLP
7
  - legal field
8
  - python
9
+ - word2vec
10
+ - doc2vec
11
  ---
12
 
13
 
32
 
33
  0. [Accessing the Language Models](#0)
34
  1. [ Introduction / Installing package](#1)
35
+ 2. [ Language Models (Details / How to use)](#2)
36
+ 1. [ Word2Vec/Doc2Vec ](#2.1)
37
+ 3. [ Demonstrations / Tutorials](#3)
38
+ 4. [ References](#4)
 
 
 
 
 
 
39
 
40
  --------------
41
 
45
 
46
  All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
47
 
 
 
 
48
  Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the language models.
49
 
50
  --------------
54
  *LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.
55
 
56
 
57
+ You first need to install the HuggingFaceHub library running the following command on terminal
58
  ``` :sh
59
+ $ pip install huggingface_hub
 
 
 
 
 
 
 
60
  ```
61
 
62
+ Import `hf_hub_download`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ```python
65
+ from huggingface_hub import hf_hub_download
66
  ```
67
 
68
+ And then you can download our Word2Vec(SG)/Doc2Vec(DBOW) and Word2Vec(CBOW)/Doc2Vec(DM) by the following commands:
 
69
 
70
  ```python
71
+ w2v_sg_d2v_dbow = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dbow_size_100_window_15_epochs_20")
72
+ w2v_cbow_d2v_dm = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dm_size_100_window_15_epochs_20")
 
 
 
 
73
  ```
74
 
75
+ --------------
76
 
 
 
 
 
 
 
77
 
 
 
 
 
78
 
79
+ <a name="2"></a>
80
+ ## 2\. Model Languages
 
 
 
81
 
82
  <a name="3.2"></a>
83
  ### 3.2\. Word2Vec/Doc2Vec
90
  elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
91
  for generating whole text representations.
92
 
93
+ Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html) for more details. Preferably use Gensim version 3.8.3.
94
 
95
 
96
  Below we have a summary table with some important information about the trained models:
103
  | ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW) | Skip-Gram (SG) | 100, 200, 300 | 15
104
 
105
 
106
+ Here we made available both models with 100 size and 15 window.
 
107
 
108
  #### Using *Word2Vec*
109
 
114
  !pip install gensim=='3.8.3'
115
  ```
116
 
117
+ Loading W2V:
118
 
119
 
120
  ```python
121
  from gensim.models import KeyedVectors
122
 
123
  #Loading a W2V model
124
+ w2v=KeyedVectors.load(w2v_cbow_d2v_dm)
125
  w2v=w2v.wv
126
  ```
127
  Viewing the first 10 entries of 'juiz' vector
170
  !pip install gensim=='3.8.3'
171
  ```
172
 
173
+ Loading D2V
174
 
175
 
176
  ```python
177
  from gensim.models import Doc2Vec
178
 
179
  #Loading a D2V model
180
+ d2v=Doc2Vec.load(w2v_cbow_d2v_dm)
181
  ```
182
 
183
  Inferring vector for a text
201
 
202
 
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  --------------
205
 
206
  <a name="4"></a>